cUrling up to the net – a LuaJIT binding

I have need to connect to some Azure services from Linux, using C++. There are a few C/C++ libraries around that will make life relatively easy in this regard, but I thought I’d go with an option that has me learn about something I don’t use that often, but will be fairly powerful and complete.  I chose to use cURL, or libcurl.so to be more precise.  Why?  cURL has been around for ages, has continued to evolve, and makes it fairly easy to do anything from ftp down/upload to https connections.  It has tons of options and knobs, including dealing with authentication, SSL, or just acting like an ordinary socket of you prefer that.

First I created a luajit binding to libcurl.  This just follows my usual learning pattern.  In order to conquer an API, you must first render it useful from Lua.  I did my typical two part binding, first a fairly faithful low level binding, then added some luajit idioms atop.  In this particular case, there’s not a big database to query, although there are quite a few options that get listed in a table:

 

CINIT(COOKIE, OBJECTPOINT, 22),
CINIT(HTTPHEADER, OBJECTPOINT, 23),
CINIT(HTTPPOST, OBJECTPOINT, 24),
CINIT(SSLCERT, OBJECTPOINT, 25),
CINIT(KEYPASSWD, OBJECTPOINT, 26),
CINIT(CRLF, LONG, 27),

This is an excerpt from the original curl.h header file. There is a listing of some 200 options which can be set on a curl connection (depending on the context of the connection). This CINIT is a macro that sets the value of an enum appropriately. Well, those macros don’t work in the luajit ffi.cdef call, so I needed a way to convert these. I could have just run the lot through the gcc pre-processor, and that would give me the values I needed, but I thought I’d take a different approach.

I wrote a bit of script to scan the curl.h file looking for the CINIT lines, and turn them into something interesting.

function startswith(s, prefix)
    return string.find(s, prefix, 1, true) == 1
end

local function writeClean(filename)
	for line in io.lines(filename) do
		if startswith(line, "CINIT") then
			name, tp, num = line:match("CINIT%((%g+),%s*(%g+),%s*(%d+)")
			print(string.format("\t%-25s = {'%s', %s},", name, tp, num))
		end
	end
end

generates…

	COOKIE                    = {'OBJECTPOINT', 22},
	HTTPHEADER                = {'OBJECTPOINT', 23},
	HTTPPOST                  = {'OBJECTPOINT', 24},
	SSLCERT                   = {'OBJECTPOINT', 25},
	KEYPASSWD                 = {'OBJECTPOINT', 26},
	CRLF                      = {'LONG', 27},

Well, that’s nice. Now I have it as a useful table. I can write another script to turn that inti enums, or anything else.

local ffi = require("ffi")

local filename = arg[1] or "CurlOptions"

local options = require(filename)

ffi.cdef[[
typedef enum {
	CURLOPTTYPE_LONG          = 0,
	CURLOPTTYPE_OBJECTPOINT   = 10000,
	CURLOPTTYPE_FUNCTIONPOINT = 20000,
	CURLOPTTYPE_OFF_T         = 30000
};
]]


local function CINIT(na,t,nu) 
	return string.format("\tCURLOPT_%s = CURLOPTTYPE_%s+%d,", na, t, nu)
end

local tbl = {}
local function addenum(name, type, number)
	table.insert(tbl, CINIT(name, type, number));
end

table.insert(tbl, "local ffi = require('ffi')");
table.insert(tbl, "ffi.cdef[[\ntypedef enum {")

for k,v in pairs(options) do
	addenum(k, v[1], v[2]);
end

table.insert(tbl, "} CURLoption;]]");

local tblstr = table.concat(tbl,'\n')
print(tblstr)
-- now get the definitions as a giant string
-- and execute it
local defs = loadstring(tblstr);
defs();

It’s actually easier than this. I threw in the loadstring just to ensure the output was valid. Yah, ok basics.

libcurl is an extremely convenient library. As such, it can be a challenge to use. Fortunately, it has an ‘easy’ interface as well. Here, I chose to wrap the easy interface in an object like wrapper to make it even easier. Here’s how you use it:

require("CRLEasyRequest")(url):perform();

That will retrieve the entirety of any given url, and dump the output to stdout. Well, that’s somewhat useful, and its only one line of code. This simple interface will become more sophisticated over time, including being the basis for REST calls, but for now, it suffices.

So, there you have it. libcurl seems to be a viable choice for web access from within lua. You could of corse just do it all with pure lua code, but, you probably wont be wrong to leverage libcurl.


Accessing Open vSwitch using LuaJIT – The LJIT2ovs project

I’ve recently had the need to learn Open vSwitch to do some virtualized networking stuff. As part of my learning experience, I’ve created the LJIT2ovs project on GitHub.

Open vSwitch (OVS) is a pretty decent bit of kit.  It’s use is to act as that bit of bridge code on a machine that supports some software defined networking.  Essentially, a network hub/switch/bridge on your machine.  My particular use case is to support interesting networking topologies with VMs, virtual lans and the like in the context of a data center.  Having OVS allows you to do some fairly fancy things in terms of managing VM movements, and virtual lans.

I have certainly done tons of integration with libraries of various types on platforms from the Raspberry pi, to Windows, and Linux in general.  Some code is easier to interop with and some code is harder.  I wrote a series a couple years back on the benefits and pitfalls of different styles of API development.

The OVS code is fairly easy to interop with using LuaJIT.  The publicly exposed stuff is all C99, and not particularly fancy.  There are a few cases where the usage of macros had me scratching my head, but a quick pass with a C compiler typically helps flesh those out.

One example will help illustrate the ease and pitfalls of doing some forms of interop.  I’ll use the UUID features in the library as an example.

First there’s data structures:

ffi.cdef[[
struct uuid {
uint32_t parts[4];
};
]]

The LuaJIT ffi makes pretty short work of that. Now the various routines that need a struct uuid will have access to what they need. The various functions are thus:

ffi.cdef[[
void uuid_init(void);
void uuid_generate(struct uuid *);
void uuid_zero(struct uuid *);
bool uuid_is_zero(const struct uuid *);
int uuid_compare_3way(const struct uuid *, const struct uuid *);
bool uuid_from_string(struct uuid *, const char *);
bool uuid_from_string_prefix(struct uuid *, const char *);
]]

Again, fairly simple. The use of ‘bool’ instead of ‘int’ as a return type from functions makes life a lot easier as the ffi will translate that to lua’s native boolean type. If it were an int, then instead of simpling doing

if not uuid_from_string(someuuidstring) then
  print("FAIL")
end

You’d have to write

if uuid_from_string(someuuidstring) ~= 0 then
  print("FAIL")
end

It seems like such a minor difference, but it’s actually a pretty huge pain in the backside. When the return value is an int, there are myriad things it can mean. One of them is ‘0 == success’, another is ‘0 == failure’, and there can by various others of course, including checking errno to find out what really went wrong or right. the bool is much more straight forward, success is usually indicated with ‘true’ and failure ‘false’.

Where things get a little more tricky and you have to pay special attention, is when you’re doing interop between different parts of the system and you flow through lua along the way.

Part of OVS has this dynamic string thing, which is one of those stretchy buffers, which will grow as necessary as you add stuff to it. One of the things it has is a ‘printf’ style formating thing.

ffi.cdef[[
void ds_put_format(struct ds *, const char *, ...) ;
]]

Yes, the LuaJIT ffi can deal with variadic functions, but you have to take extra care when feeding it the variable arguments. In particular, you’re responsible for dealing with the types of those arguments. This is where the care comes in. With normal functions (non-variadic), the ffi marshaling will just take care of things, and you don’t even realize a “number” is being turned into an ‘int32_t’, or whatever it needs to be. The ffi just figures it out based on the signature, and does the right thing. Not so with variadics. It can’t know what you need to pass.

So, here’s more from the UUID routines, some macros.

#define UUID_LEN 36
#define UUID_FMT "%08x-%04x-%04x-%04x-%04x%08x"
#define UUID_ARGS(UUID)                             \
    ((unsigned int) ((UUID)->parts[0])),            \
    ((unsigned int) ((UUID)->parts[1] >> 16)),      \
    ((unsigned int) ((UUID)->parts[1] & 0xffff)),   \
    ((unsigned int) ((UUID)->parts[2] >> 16)),      \
    ((unsigned int) ((UUID)->parts[2] & 0xffff)),   \
    ((unsigned int) ((UUID)->parts[3]))

In general, macros are of two types. First are the simple #defines which can turn into local variables. In this particular case, the UUID_LEN and UUID_FMT fit the bill. The UUID_LEN is just a numeric constant, so it can just become a regular number. The UUID_FMT doesn’t use any particularly fancy format routines (luckly lua string.format is similar enough in most cases), so that can become a simple string.

local UUID_LEN = 36
local UUID_FMT = "%08x-%04x-%04x-%04x-%04x%08x"

But what about that macro? Can we just turn it into a function, strip out the naughty bits, and call it a day?

local function UUID_ARGS(UUID)                             
    return ( ((UUID).parts[0])),            
    ( rshift((UUID).parts[1], 16)),      
    ( band((UUID).parts[1], 0xffff)),   
    ( rshift((UUID).parts[2], 16)),      
    ( band((UUID).parts[2], 0xffff)),   
    ( ((UUID).parts[3]))
end

Well, I want to write the following:

local success = uuid.uuid_from_string_prefix(id1, "12345678-1234-5678-1234-5678123456781234")
local output = ds.dynamic_string();
ds.ds_put_format(output, uuid.UUID_FMT, uuid.UUID_ARGS(id1))

print("OUTPUT: ", ds.ds_cstr(output));

Basically just round trip a string into a UUID and back out into a stretchy buffer, and print it as a lua string. Does it work? Nope, not quite. In this case, the output would be:

00000000-0000-0000-0000-45e4de9000000000

Not quite what I was expecting. Head scratch. OK, well, obviously I was a bit simplistic in my conversion of that UUID_ARGS macro. Maybe I should have preserved that casting stuff.

local uint32_t = ffi.typeof("uint32_t")

local function UUID_ARGS(UUID)                             
    local p1 = uint32_t(UUID.parts[0])
    local p2 = uint32_t(rshift(UUID.parts[1], 16))
    local p3 = uint32_t(band(UUID.parts[1], 0xffff))
    local p4 = uint32_t(rshift(UUID.parts[2], 16))
    local p5 = uint32_t(band(UUID.parts[2], 0xffff))
    local p6 = uint32_t(UUID.parts[3])

    return p1, p2, p3, p4, p5, p6
end

In addition to the temporary variables (used for debugging), I stripped out a number of parenthesis, because lua lives better with fewer parens. Basically, I used the ‘uint32_t’ type in a way that is similar to a typecast. Of course, it’s actually worse the way I’ve used it here because it creates a temporary number, which needs to be garbage collected, but the result is now:

OUTPUT: 12345678-1234-5678-1234-567812345678

Which is what I was expecting. So, maybe one last pass to clean it up…

local function UUID_ARGS(UUID)                             
    return ffi.cast("unsigned int", (UUID.parts[0])),
        ffi.cast("unsigned int", rshift(UUID.parts[1], 16)),
        ffi.cast("unsigned int", band(UUID.parts[1], 0xffff)),
        ffi.cast("unsigned int", rshift(UUID.parts[2], 16)),
        ffi.cast("unsigned int", band(UUID.parts[2], 0xffff)),
        ffi.cast("unsigned int", UUID.parts[3])
end

So, essentially, a more faithful representation of the original. Note though, although this will work with the UUID_FMT, when using a ‘C’ based ‘printf’ statements, it will NOT work with the lua string.format routine. It will complain that it was expecting a number, but received a ‘cdata’ instead. So, you have to think fairly clearly about how those APIs and macros are going to be used, and therefore how they should be translated.

Here’s that uuid.lua file in its entirety:

local ffi = require("ffi")
local bit = require("bit")

local rshift, band = bit.rshift, bit.band

local function BUILD_ASSERT_DECL(...)
    assert(...);
end

local UUID_BIT = 128;            -- Number of bits in a UUID. */
local UUID_OCTET = (UUID_BIT / 8); -- Number of bytes in a UUID. */

ffi.cdef[[
/* A Universally Unique IDentifier (UUID) compliant with RFC 4122.
 *
 * Each of the parts is stored in host byte order, but the parts themselves are
 * ordered from left to right.  That is, (parts[0] >> 24) is the first 8 bits
 * of the UUID when output in the standard form, and (parts[3] & 0xff) is the
 * final 8 bits. */
struct uuid {
    uint32_t parts[4];
};
]]

BUILD_ASSERT_DECL(ffi.sizeof("struct uuid") == UUID_OCTET);


local UUID_LEN = 36;
local UUID_FMT = "%08x-%04x-%04x-%04x-%04x%08x";

local function UUID_ARGS(UUID)                             
    return ffi.cast("unsigned int", (UUID.parts[0])),
        ffi.cast("unsigned int", rshift(UUID.parts[1], 16)),
        ffi.cast("unsigned int", band(UUID.parts[1], 0xffff)),
        ffi.cast("unsigned int", rshift(UUID.parts[2], 16)),
        ffi.cast("unsigned int", band(UUID.parts[2], 0xffff)),
        ffi.cast("unsigned int", UUID.parts[3])
end


--[[
/* Returns a hash value for 'uuid'.  This hash value is the same regardless of
 * whether we are running on a 32-bit or 64-bit or big-endian or little-endian
 * architecture. */
--]]
local function uuid_hash(uuid)
    return uuid.parts[0];
end

-- Returns true if 'a == b', false otherwise. */
local function uuid_equals(a, b)

    return (a.parts[0] == b.parts[0]
            and a.parts[1] == b.parts[1]
            and a.parts[2] == b.parts[2]
            and a.parts[3] == b.parts[3]);
end


  
ffi.cdef[[
void uuid_init(void);
void uuid_generate(struct uuid *);
void uuid_zero(struct uuid *);
bool uuid_is_zero(const struct uuid *);
int uuid_compare_3way(const struct uuid *, const struct uuid *);
bool uuid_from_string(struct uuid *, const char *);
bool uuid_from_string_prefix(struct uuid *, const char *);
]]

local Lib_uuid = ffi.load("openvswitch")

-- initialize uuid routines
Lib_uuid.uuid_init();

local exports = {
    Lib_uuid = Lib_uuid;

    UUID_BIT = UUID_BIT;
    UUID_OCTET = UUID_OCTET;
    UUID_FMT = UUID_FMT;
    UUID_ARGS = UUID_ARGS;

    -- types
    uuid = ffi.typeof("struct uuid");

    -- inline routines
    uuid_hash = uuid_hash;
    uuid_equals = uuid_equals;

    -- library functions
    uuid_init = Lib_uuid.uuid_init;
    uuid_generate = Lib_uuid.uuid_generate;
    uuid_zero = Lib_uuid.uuid_zero;
    uuid_is_zero = Lib_uuid.uuid_is_zero;
    uuid_compare_3way = Lib_uuid.uuid_compare_3way;
    uuid_from_string = Lib_uuid.uuid_from_string;
    uuid_from_string_prefix = Lib_uuid.uuid_from_string_prefix;
}

return exports

Just a couple more notes on style. When I’m using such a binding, I’ll typically do the following:

local uuid = require("uuid")

From there, I will want to utilize a simple notation to get at things.

local id = uuid.uuid(); 
uuid.uuid_generate(id);

local output = ds.dynamic_string();
ds.ds_put_format(output, uuid.UUID_FMT, uuid.UUID_ARGS(id))

 

I want to minimize the amount of times I have to use the ffi routines directly because they’re fairly low level, and can lead to some very subtle errors if you’re not careful. So, a properly constructed module will wrap up as much as possible.

In some cases I’ll create a ‘class’, which is a table to encapsulate the various bits and related bytes. In this case I did not, but I might still do it later if I find it useful.

I will typically do the ffi.load to get a handle on the library from which the routines are called, and stick that reference into a table, so the library reference won’t be garbage collected, and therefore possibly unloaded from the running system. And lastly, the function names have this construct:

    uuid_generate = Lib_uuid.uuid_generate;

It’s a simple aliasing so that I don’t have to remember which library the function is located in, I can just call uuid.uuid_generate, and the right thing will happen. This isn’t the most performant way to do it as the compiler can’t optimize the function call as much, so it will be slower. These routines are not typically called in a performant loop, so it’s an ok thing. If you were being really hard core, and needed the extra cycles, you’d make the library call directly instead of going through the table indirections.

The last trick you can do is to import that exports table into the global namespace:

for k,v in pairs(uuid) do
    _G[k] = v;
end

If you do this, then your code can drop the table indirection and become simply:

local id = uuid(); 
uuid_generate(id);

local output = dynamic_string();
ds_put_format(output, UUID_FMT, UUID_ARGS(id))

This is looking pretty close to the original ‘C’ code. Fewer pointers to deal with, no memory allocation problems, no funky difference between ‘.’ and ‘->’ indirection. In this way, Lua is just a nicer version of ‘C’ 🙂 My coworkers will get a kick out of that sentiment.

And so, there you have it. A fairly small example of beginning to interop with OVS. Open vSwitch is a fairly large codebase. It’s split between simple base library routines, reporting, core packet handling, and higher level apps, daemons and utilities. Starting from the top, the utilities are fairly easy to replace, once the interop at the lowest levels is dealt with. In LJIT2ovs, the lower levels are fairly well covered, and coverage increases as needs dictate. At the higher level, I am beginning with replacing simple routines, such as querying the database. At the end of the day, these higher level utilities are much easier to write, maintain, expand, and incorporate when they are written in script, rather than C.

Next time around I’ll examine some of these routines, and how they can be rewritten, and what benefits might acrue from that exercise.


When Scripts Roamed the Earth

Way way back in the day, I played with Tcl.  What a nice little compact thing that was.  Then along came this thing called Python.  Kind of funky with it’s indentation thing, but wow, what it has become!  I was cutting my CS chops when ‘p-code’ meant something.  Then along came this Javascript thing.  For the longest time, I think it kind of puttered along, until BAM!  The internet exploded, and more recently node.js happened.  Now suddenly it’s becoming a ‘de-facto’ go to language of the day.

But, another thing has happened recently as well.  With the V8 javascript compiler comes JIT compilation.  Then along comes Lua, and Go, and Python again, and suddenly ‘script’ is becoming as fast, if not faster, than statically compiled ‘C’, which has been the mainstay of computer programming for a few decades now.

And now, two other things are happening.  LuaJIT has this thing called dynasm.  This Dynamic Assembler quickly turns what looks like embedded assembly instructions into actual machine instructions at ‘runtime’.  This is kind of different than what nasm does.  Nasm is an assembler proper.  It takes assembly instructions, and turns that into machine specific code, as part of a typical ‘compile/link/run’ chain.  Dynasm just generates a function in memory, and then you can call it directly, while your program is running.

This concept of dynamic machine code generation seems to be a spreading trend, and all JIT runtimes do it.  I just came across another tool that helps you embed such a JIT thing into your C++ code.  Asmjit is the tool that does a thing similar to what luajit’s dynasm does.

These of course are not unique, and I’m sure there are countless projects that can be pointed to that do something somewhat similar.  And that’s kind of the point.  This dynamic code generation and execution thing is rapidly leaving the p-code phase, and entering the direct machine execution phase, which is making dynamic languages all the more usable and performant.

So, what’s next?

Well, that got me to thinking.  If really fast code can be delivered and executed at runtime, what kinds of problems can be solved?  Remote code execution is nothing new.  There are always challenges with marshaling, versioning, different architectures, security, and the like.  Some of the problems that exist are due to the typically static nature of the code that is being executed on both ends.  Might things change if both ends are more dynamic?

Take the case of TLS/SSL.  There’s all these certificate authorities, which is inherently fragile and error prone.  Then there’s the negotiation of the highest common denominator parameters for the exchange of data.  Well, what if this whole mess were given over to a dynamic piece?  Rather than negotiating the specifics of the encryption mechanism, the two parties could simply negotiate and possibly transfer a chunk of code to be executed.

How can that work?  The client connects to the server, using some mechanism to identify itself (possibly anonymous, possibly this is handled higher up in the stack).  The server then sends a bit of code that the client will then use to pass through every chunk of data that’s headed to the server.  Since the client has dynasm embedded, it can compile that code, and continue operating.  Whomever wrote the client doesn’t know anything about the particulars of communicating with the server.  They didn’t mess up the cryptography, they didn’t have to keep up to date with the latest heart bleed.  The server can change and customize the exchange however they see fit.

The worst case scenario is that the parties cannot agree on anything interesting, so they fall back to using plain old TLS.  This seems useful to me.  A lot of code, that has a high probability of being done wrong, is eliminated from the equation.  If certificate authorities are desired, then they can be used.  If something more interesting is desired, it can easily be encoded and shared.  If thing need to change instantly, it’s just a change on the server side, and move along.

Of course each side needs to provide an appropriate sandbox so the code doesn’t just execute something arbitrary.  Each side also needs to provide some primitives, like ability to grab certificates if needed, and access to crypto libraries if needed.

If the server wants to use a non-centralized form of identity, it can just code that up, and be on its way.  The potential is high for extremely dynamic communications, as well as mischief.

And what else?  Well, I guess just about anything that can benefit from being dynamic.  Learning new gestures, voice recognition, image recognition, learning to walk, learning new algorithms for searching, sorting, filtering, etc.  Just about anything.

Following this line of reasoning, I’d expect my various machines to start talking with each other using protocols of their own making.  Changing dynamically to fit whatever situation they encounter.  The communications algorithms go meta.  We need algorithms to create algorithms.  Threats and intrusions are perceived, and dealt with dynamically.  No waiting for a rev of the OS, no centrally distributed patches, no worrying about incompatible versions of this that and the other thing.  The machines, and their communications, become individual, dynamic, and non-static.

This could be interesting.

 


Fast Apps, Microsoft Style

Pheeeuuww!!

That’s what I exclaimed at least a couple of times this morning as I sat at a table in a makeshift “team room” in building 43 at Microsoft’s Redmond campus. What was the exclamation for? Well, over the past 3 months, I’ve been working on a quick strike project with a new team, and today we finally announced our “Public Preview“.  Or, if you want to get right to the product: Cloud App Discovery

I’m not a PM or marketing type, so it’s best to go and read the announcement for yourself if you want to get the official spiel on the project.  Here, I want to write a bit about the experience of coming up with a project, in short order, in the new Microsoft.

It all started back in January for me.  I was just coming off another project, and casting about for the next hardest ‘mission impossible’ to jump on.  I had a brief conversation with a dev manager who posed the question; “Is it possible to reestablish the ‘perimeter’ for IT guys in this world of cloud computing”?  An intriguing question.  The basic problem was, if you go to a lot of IT guys, they can barely tell you how many of the people within their corporation are using SalesForce.com, let alone DropBox from a cafe in Singapore.  Forget the notion of even trying to control such access.  The corporate ‘firewall’ is almost nothing more than a quartz space heater at this point, preventing very little, and knowing about even less.

So, with that question in mind, we laid out 3 phases of development.  Actually, they were already laid out before I joined the party (by a couple of weeks), so I just heard the pitch.  It was simple, the first phase of development is to see if we can capture network traffic, using various means, and project it up to the  cloud where we could use some machine learning to give an admin a view of what’s going on.

Conveniently sidestepping any objections actual employees might have with this notion, I got to thinking on how it could be done.

For my part, we wanted to have something sitting on the client machine (a windows machine that the user is using), which will inspect all network traffic coming and going, and generate some reports to be sent up to the cloud.  Keep in mind, this is all consented activity, the employee gets to opt in to being monitored in this way.  All in the open and up front.

At the lowest level, my first inclination was to use a raw socket to create a packet sniffer, but Windows has a much better solution these days, built for exactly this purpose.  The Windows Filter Platform, allows you to create a ‘filter’ which you can configure to callout to a function whenever there is traffic.  My close teammate implemented that piece, and suddenly we had a handle on network packets.

We fairly quickly decided on an interface between that low level packet sniffing, and the higher level processor.  It’s as easy as this:

 

int WriteBytes(char *buff, int bufflen);
int ReadBytes(char *buff, int bufflen, int &bytesRead);

I’m paraphrasing a bit, but it really is that simple. What’s it do? Well, the fairly raw network packets are sent into ‘WriteBytes’, some processing is done, and a ‘report’ becomes available through ‘ReadBytes’. The reports are a JSON formatted string which then gets turned into the appropriate thing to be sent up to the cloud.

The time it took from hearing about the basic product idea, to a prototype of this thing was about 3 weeks.

What do I do once I get network packets? Well, the network packets represent a multiplexed stream of packets, as if I were a NIC. All incoming, outgoing, all TCP ports. Once I receive some bytes, I have to turn it back into individual streams, then start doing some ‘parsing’. Right now we handle http and TLS. For http, I do full http parsing, separating out headers, reading bodies, and the like. I did that by leveraging the http parsing work I had done for TINN already. I used C++ in this case, but it’s all relatively the same.

TLS is a different story. At this ‘discovery’ phase, it was more about simple parsing. So, reading the record layer, decoding client_hello and server_hello, certificate, and the like. This gave me a chance to implement TLS processing using C++ instead of Lua. One of the core components that I leveraged was the byte order aware streams that I had developed for TINN. That really is the crux of most network protocol handling. If you can make herds or tails of what the various RFCs are saying, it usually comes down to doing some simple serialization, but getting the byte ordering is the hardest part. 24-bit big endian integers?

At any rate, http parsing, fairly quick. TLS client_hello, fast enough, although properly handling the extensions took a bit of time. At this point, we’d be a couple months in, and our first partners get to start kicking the tires.

For such a project, it’s very critical that real world customers are involved really early, almost sitting in our design meetings. They course corrected us, and told us what was truly important and annoying about what we were doing, right from day one.

From the feedback, it becomes clear that getting more information, like the amount of traffic flowing through the pipes is as interesting as the meta information, so getting the full support for flows becomes a higher priority. For the regular http traffic, no problem. The TLS becomes a bit more interesting. In order to deal with that correctly, it becomes necessary to suck in more of the TLS implementation. Read the server_hello, and the certificate information. Well, if you’re going to read in the cert, you might as well get the subject common name out so you can use that bit of meta information. Now comes ASN.1 (DER) parsing, and x509 parsing. That code took about 2 weeks, working “nights and weekends” while the other stuff was going on. It took a good couple of weeks not to integrate, but to write enough test cases, with real live data, to ensure that it was actually working correctly.

The last month was largely a lot of testing, making sure corner cases were dealt with and the like. As the client code is actually deployed to a bunch of machines, it really needed to be rock solid, no memory leaks, no excessive resource utilization, no CPU spiking, just unobtrusive, quietly getting the job done.

So, that’s what it does.

Now, I’ve shipped at Microsoft for numerous years. The fastest cycles I’ve usually dealt with are on the order of 3 months. That’s usually for a product that’s fairly mature, has plenty of engineering system support, and a well laid out roadmap. Really you’re just turning the crank on an already laid out plan.

This AppDiscovery project has been a bit different. It did not start out with a plan that had a 6 month planning cycle in front of it. It was a hunch that we could deliver customer value by implementing something that was challenging enough, but achievable, in a short amount of time.

So, how is this different than Microsoft of yore? Well, yes, we’ve always been ‘customer focused’, but this is to the extreme. I’ve never had customers this involved in what I was doing this early in the development cycle. I mean literally, before the first prototypical bits are even dry, the PM team is pounding on the door asking “when can I give it to the customers?”. That’s a great feeling actually.

The second thing is how much process we allowed ourselves to use. Recognizing that it’s a first run, and recognizing that customers might actually say “mehh, not interested”, it doesn’t make sense to spin up the classic development cycle which is meant to maintain a product for 10-14 years. A much more streamlined lifecycle which favors delivering quality code and getting customer feedback, is what we employed. If it turns out that customers really like the product, then there’s room to fit the cycle to a cycle that is more appropriate for longer term support.

The last thing that’s special is the amount of leveraging Open Source we are allowing ourselves these days. Microsoft has gone full tilt on OpenSource support. I didn’t personally end up using much myself, but we are free to use it elsewhere (with some legal guidelines). This is encouraging, because for crypto, I’m looking forward to using things like SipHash, and ChaCha20, which don’t come natively with the Microsoft platform.

Overall, as Microsoft continues to evolve and deliver ‘customer centric’ stuff, I’m pretty excited and encouraged that we’ll be able to use this same model time and again to great effect. Microsoft has a lot of smart engineers. Combined with some new directives about meeting customer expectations at the market, we will surely be cranking out some more interesting stuff.

I’ve implemented some interesting stuff while working on this project, some if it I’ll share here.


Revisiting C++

I was a C++ expert twice in the past. The first time around was because I was doing some work for Taligent, and their whole operating system was written in C++. With that system I got knee deep into the finer details of templates, and exceptions, to a degree that will likely never be seen on the planet earth.

The second time around, was because I was programming on the BeOS. Not quite as crazy as the Taligent experience, but C/C++ were all the rage.

Then I drifted into Microsoft, and C# was born. For the past 15 years, it’s been a slow rise to dominance with C# in certain quarters of Microsoft. It just so happens that this corresponds to the rise of the virus attacks on Windows, as well as the shift in programming skills of college graduates. In the early days of spectacular virus attacks, you could attribute most of them to buffer overruns, which allowed code to run on the stack. This was fairly easily plugged by C#, and security coding standards.

Today, I am working on a project where once again I am learning C++. This time around it’s C++ 11, which is decidedly more mature than the C++ I learned while working on Taligent. It’s not so dramatically different as say the difference between Lisp and Cobol, but it gained a lot of stuff over the years.

I thought I would jot down some of the surface differences I have noticed since I’ve been away.

First, to compare C++ to Lua, there are some surface differences. Most of the languages I program in today have their roots in Algol, so they largely look the same. But, there are some simple dialect differences. C++ is full of curly braces ‘{}’, semi-colons ‘;’, and parenthesis ‘()’. Oh my god with the parens and semis!! With Lua, parens are optional, semis are optional, and instead of curlies, there are ‘do’, ‘end’, or simply ‘end’. For loops are different, array indices are different (unless you’re doing interop with the FFI), and do/while is repeat/until.

These are all minor differences, like say the differences between Portuguese and Spanish. You can still understand the other if you speak one. Perhaps not perfectly, but there is a relatively easy translation path.

Often times in language wars, these are the superficial differences that people talk about. Meh, not interesting enough to drive me one way or another.

But then, there’s this other stuff, which is truly the essence of the differences. Strong typing/duck typing, managed memory, dynamic code execution. I say ‘Lua’ here, but really that could be a standin for C#, node.js, Python, or Ruby. Basically, there are a set of modern languages which exhibit a similar set of features which are different enough from C/C++ that there is a difference in the programming models.

To illustrate, here’s a bit of C++ code that I have written recently. The setup is this, I receive a packet of data, typically the beginning of a HTTP conversation. From that packet of data, I must be able to ‘parse’ the thing, determine whether it is http/https, pull out headers, etc. I need to build a series of in-place parsers, which keep the amount of memory allocated to a minimum, and work fairly quickly. So, the first piece is this thing called a AShard_t:

#pragma once

#include "anl_conf.h"

class  DllExport AShard_t  {
public:
	uint8_t *	m_Data;
	size_t	m_Length;
	size_t	m_Offset;

	// Constructors
	AShard_t();
	AShard_t(const char *);
	AShard_t(uint8_t *data, size_t length, size_t offset);

	// Virtual Destructor
	virtual ~AShard_t() {};

	// type cast
	operator uint8_t *() {return getData();}

	// Operator Overloads
	AShard_t & operator= (const AShard_t & rhs);

	// Properties
	uint8_t *	getData() {return &m_Data[m_Offset];};
	size_t		getLength() {return m_Length;};

	// Member functions
	AShard_t &	clear();
	AShard_t &	first(AShard_t &front, AShard_t &rest, uint8_t delim) const;
	bool		indexOfChar(const uint8_t achar, size_t &idx) const;
	bool		indexOfShard(const AShard_t &target, size_t &idx);
	bool 		isEmpty() const;
	void		print() const;
	bool		rebase();
	char *		tostringz() const;
	AShard_t &	trimfrontspace();

};

OK, so it’s actually a fairly simple data structure. Assuming you have a buffer of data, a shard is just a pointer into that buffer. It contains the pointer, an offset, and a length. You might say that the pointer/offset combo is redundant, you probably don’t need both. The offset could be eliminated, assuming the pointer is always at the base of the structure. But, there might be a design choice that makes this useful later.

At any rate, there’s a lot going on here for such a simple class. First of all, there’s that ‘#pragma once’ at the top. Ah yes, good ol’ C preprocessor, needs to be told not to load stuff it’s already loaded before. There’s there’s class vs struct, not to be confused with ‘typedef struct’. Public/Protected/Private, copy constructor or ‘operator=’. And heaven forbid you forget to make a public default constructor. You will not be able to create an array of these things without it!

These are not mere dialectual differences, these are the differences between Spanish and Hungarian. You MUST know about the default constructor thing, or things just won’t work.

As far as implementation is concerned, I did a mix of things here, primarily because the class is so small. I’ve inserted some simple “string” processing right into the class, because I found them to be constantly useful. ‘first’, ‘indexOfChar’, and ‘indexOfShard’ turn out to be fairly handy when you’re trying to parse through something in place. ‘first’ is like in Lisp, get the first element off the list of elements. In this case you can specify a single character delimiter. ‘indexOfChar’, is like strchr() function from C, except in this case it’s aware of the length, and it doesn’t assume a ‘null’ terminated string. ‘indexOfShard’ is like ‘strstr’, or ‘strpbrk’. With these in hand, you can do a lot of ‘tokenizing’.

Here’s an example of parsing a URL:

bool parseUrl(const AShard_t &uriShard)
{
  AShard_t shard = uriShard;
  AShard_t rest;
	
  AShard_t scheme;
  AShard_t url;
  AShard_t authority;
  AShard_t hostname;
  AShard_t port;
  AShard_t resquery;
  AShard_t resource;
  AShard_t query;

  // http:
  shard.first(scheme, rest, ':');

  // the 'rest' represents the resource, which 
  // includes the authority + query
  // so try and separate authority from query if the 
  // query part exists
  shard = rest;
  // skip past the '//'
  shard.m_Offset += 2;
  shard.m_Length -= 2;

  // Now we have the url separated from the scheme
  url = shard;

  // separate the authority from the resource based on '/'
  url.first(authority, rest, '/');
  resquery = rest;

  // Break the authority into host and port
  authority.first(hostname, rest, ':');
  port = rest;

  // Back to the resource.  Split it into resource/query
  parseResourceQuery(resquery, resource, query);


  // Print the shards
  printf("URI: "); uriShard.print();
  printf("  Scheme: "); scheme.print();
  printf("  URL: "); url.print();
  printf("    Authority: "); authority.print();
  printf("      Hostname: "); hostname.print();
  printf("      Port: "); port.print();
  printf("    Resquery: "); resquery.print();
  printf("      Resource: "); resource.print();
  printf("      Query: "); query.print();
  printf("\n");

  return true;
}

AShard_t url0("http://www.sharmin.com:8080/resources/gifs/bunny.gif?user=willynilly&password=funnybunny");
parseUrl(url0);

Of course, I’m leaving out error checking, but even for this simple tokenization, it’s fairly robust because in most cases, if a ‘first’ fails, you’ll just gen an empty ‘rest’, but definitely not a crash.

So, how does this fair against my beloved LuaJIT? Well, at this level things are about the same. In Lua, I could create exactly the same structure, using a table, and perform exactly the same operations. Only, if I wanted to do it without using the ffi, I’d have to stuff the data into a Lua string object (which causes a copy), then use the lua string.char, count from 1, etc. totally doable, and probably fairly optimized. There is a bit of a waste though because in Lua, everything interesting is represented by a table, so that’s a much bigger data structure than this simple AShard_t. It’s bigger in terms of memory footprint, and it’s probably slower in execution because it’s a generalized data structure that can serve many wonderful purposes.

For memory management, at this level of structure, things are relatively easy. Since the shard does not copy the data, it doesn’t actually do any allocations, so there’s relatively little to cleanup. The most common use case for shards is that they’ll either be stack based, or they’ll be stuffed into a data structure. In either case, their lifetime is fairly short and well managed, so memory management isn’t a big issue. If they are dynamically allocated, then of course there’s something to be concerned with.

Well, that touches the ice berg. I’ve re-attached to C++, and so far the gag reflex hasn’t driven me insane, so I guess it’s ok to continue.

Next, I’ll explore how insanely great the world becomes when shards roam the earth.


Asynchronous DNS lookups on Windows

I began this particular journey because I wanted to do DNS lookups asynchronously on Windows. There is of course a function for that:

DnsQueryEx

The problem I ran into is that unlike the various other Windows functions I’ve done with async, this one does not use an IO Completion Port. Instead it uses a mechanism called APC (Asynchronouse Procedure Call). With this little bit of magic, you pass in a pointer to a function which will be called in your thread context, kind of in between when other things are happening. Well, given the runtime environment I’m using, I don’t think this quite works out. Basically, I’d have a function being called leaving the VM in an unknown state.

So, I got to digging. I figured, how hard can it be to make calls to a DNS server directly? After all, it is nothing more than a network based service with a well known protocol. Once I could make a straight networking call, then I could go back to leveraging IO Completion Ports just like I do for all other IO.

You can view the DNS system as nothing more than a database to which you pose queries. You express your queries using some nice well defined protocol, which is ancient in origin, and fairly effective considering how frequently DNS queries are issued. Although I could man up and write the queries from scratch, Windows helps me here by providing functions that will format the query into a buffer for me.

But, before I get into that, what do the queries look like? What am I looking up? Well, a Domain Name Server serves up translations of names to other names and numbers. For example, I need to find the IP address of http://www.bing.com. I can look for CNAMES (an alias), or ‘A’ records (direct to an IP address. This gets esoteric and confusing, so a little code can help:

-- Prepare the DNS request
local dwBuffSize = ffi.new("DWORD[1]", 2048);
local buff = ffi.new("uint8_t[2048]")

local wID = clock:GetCurrentTicks() % 65536;
        
local res = windns_ffi.DnsWriteQuestionToBuffer_UTF8( 
  ffi.cast("DNS_MESSAGE_BUFFER*",buff), 
  dwBuffSize, 
  ffi.cast("char *",strToQuery), 
  wType, 
  wID, 
  true )

DnsWriteQuestionToBuffer_UTF8 is the Windows function which helps me to write a DNS query into a buffer, which will then be send to the actual dns server.

wType, represents the type of record you want to be returned. The values might be something like:

wType = ffi.C.DNS_TYPE_A
wType = ffi.C.DNS_TYPE_MX  - mail records
wType = ffi.C.DNS_TYPE_CNAME

There are about a hundred different types that you can query for. The vast majority of the time though, you either looking for ‘A’, or ‘CNAME’ records.

The wID is just a unique identifier for the particular query so that if you’re issuing several on the same channel, you can check the response to ensure they match up.

OK. Now I have a DNS query stuffed into a buffer, how do I make the query and get the results?

-- Send the request.
local IPPORT_DNS = 53;
local remoteAddr = sockaddr_in(IPPORT_DNS, AF_INET);
remoteAddr.sin_addr.S_addr = ws2_32.inet_addr( "209.244.0.3");

-- create the UDP socket
local socket, err = NativeSocket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );

-- send the query
local iRes, err = socket:sendTo(
  ServerAddress, ffi.sizeof(ServerAddress), 
  buff, dwBuffSize[0]);

This little bit of code shows the socket creation, and the actual ‘sendTo’ call. Of note, the “209.244.0.3” represents the IP address of a well known public DNS server. In this case it is hosted by Level 3, which is a internet services provider. There are of course calls you can make to figure out which DNS server your machine is typically configured to use, but this way the query will always work, no matter which network you are on.

Notice the socket is a UDP socket.

At this point, we’re already running cooperatively due to the fact that within TINN, all IO is done cooperatively, without the programmer needing to do much special.

Now to receive the query response back:

   -- Try to receive the results
    local RecvFromAddr = sockaddr_in();
    local RecvFromAddrSize = ffi.sizeof(RecvFromAddr);
    local cbReceived, err = self.Socket:receiveFrom(RecvFromAddr, RecvFromAddrSize, buff, 2048);

Basically just wait for the server to send back a response. Of course, like the sendTo, the receiveFrom works cooperatively, so that if the developer issues several ‘spawn’ commands, each query could be running in its own task, working cooperatively.

Once you have the response, you can parse out the results. The results come back as a set of records. There are of course functions which will help you to parse these records out. The key here is that the record type is indicated, and its up to the developer to pull out the relevant information.

The complete DNSNameServer class is here:

local ffi = require("ffi")

local Application = require("Application")
local windns_ffi = require("windns_ffi")
local NativeSocket = require("NativeSocket")
local ws2_32 = require("ws2_32")
local Stopwatch = require("StopWatch")

local clock = Stopwatch();

-- DNS UDP port
local IPPORT_DNS = 53;

local DNSNameServer = {}
setmetatable(DNSNameServer, {
    __call = function(self, ...)
        return self:create(...)
    end,
})
local DNSNameServer_mt = {
    __index = DNSNameServer,
}

function DNSNameServer.init(self, serveraddress)
    local socket, err = NativeSocket( AF_INET, SOCK_DGRAM, IPPROTO_UDP );

    if not socket then
        return nil, err
    end

    local obj = {
        Socket = socket,
        ServerAddress = serveraddress,
    }
    setmetatable(obj, DNSNameServer_mt)

    return obj;
end

function DNSNameServer.create(self, servername)
    local remoteAddr = sockaddr_in(IPPORT_DNS, AF_INET);
    remoteAddr.sin_addr.S_addr = ws2_32.inet_addr( servername );

    return self:init(remoteAddr)
end

-- Construct DNS_TYPE_A request, send it to the specified DNS server, wait for the reply.
function DNSNameServer.Query(self, strToQuery, wType, msTimeout)
    wType = wType or ffi.C.DNS_TYPE_A
    msTimeout = msTimeout or 60 * 1000  -- 1 minute


    -- Prepare the DNS request
    local dwBuffSize = ffi.new("DWORD[1]", 2048);
    local buff = ffi.new("uint8_t[2048]")

    local wID = clock:GetCurrentTicks() % 65536;
        
    local res = windns_ffi.DnsWriteQuestionToBuffer_UTF8( ffi.cast("DNS_MESSAGE_BUFFER*",buff), dwBuffSize, ffi.cast("char *",strToQuery), wType, wID, true )

    if res == 0 then
        return false, "DnsWriteQuestionToBuffer_UTF8 failed."
    end

    -- Send the request.
    local iRes, err = self.Socket:sendTo(self.ServerAddress, ffi.sizeof(self.ServerAddress), buff, dwBuffSize[0]);
    

    if (not iRes) then
        print("Error sending data: ", err)
        return false, err
    end

    -- Try to receive the results
    local RecvFromAddr = sockaddr_in();
    local RecvFromAddrSize = ffi.sizeof(RecvFromAddr);
    local cbReceived, err = self.Socket:receiveFrom(RecvFromAddr, RecvFromAddrSize, buff, 2048);

    if not cbReceived then
        print("Error Receiving Data: ", err)
        return false, err;
    end

    if( 0 == cbReceived ) then
        return false, "Nothing received"
    end

    -- Parse the DNS response received with DNS API
    local pDnsResponseBuff = ffi.cast("DNS_MESSAGE_BUFFER*", buff);
    windns_ffi.DNS_BYTE_FLIP_HEADER_COUNTS ( pDnsResponseBuff.MessageHead );

    if pDnsResponseBuff.MessageHead.Xid ~= wID then        
        return false, "wrong transaction ID"
    end

    local pRecord = ffi.new("DNS_RECORD *[1]",nil);

    iRes = windns_ffi.DnsExtractRecordsFromMessage_W( pDnsResponseBuff, cbReceived, pRecord );
    
    pRecord = pRecord[0];
    local pRecordA = ffi.cast("DNS_RECORD *", pRecord);
    
    local function closure()
        if pRecordA == nil then
            return nil;
        end

        if pRecordA.wType == wType then
            local retVal = pRecordA
            pRecordA = pRecordA.pNext

            return retVal;
        end

        -- Find the next record of the specified type
        repeat
            pRecordA = pRecordA.pNext;
        until pRecordA == nil or pRecordA.wType == wType
    
        if pRecordA ~= nil then
            local retVal = pRecordA
            pRecordA = pRecordA.pNext
            
            return retVal;
        end

        -- Free the resources
        if pRecord ~= nil then
            windns_ffi.DnsRecordListFree( pRecord, ffi.C.DnsFreeRecordList );
        end 

        return nil
    end

    return closure
end

function DNSNameServer.A(self, domainToQuery) return self:Query(domainToQuery, ffi.C.DNS_TYPE_A) end
function DNSNameServer.MX(self, domainToQuery) return self:Query(domainToQuery, ffi.C.DNS_TYPE_MX) end
function DNSNameServer.CNAME(self, domainToQuery) return self:Query(domainToQuery, ffi.C.DNS_TYPE_CNAME) end
function DNSNameServer.SRV(self, domainToQuery) return self:Query(domainToQuery, ffi.C.DNS_TYPE_SRV) end

return DNSNameServer

Notice at the end there are some convenience functions for a few of the well known DNS record types. The ‘Query()’ function is generic, and will return records of any type. These convenience functions just make it easier.

And how to use it?

local ffi = require("ffi")
local DNSNameServer = require("DNSNameServer")
local core_string = require("core_string_l1_1_0")


--local serveraddress = "10.211.55.1"		-- xfinity
local serveraddress = "209.244.0.3" -- level 3

local domains = {
	"www.nanotechstyles.com",
	"www.adafruit.com",
	"adafruit.com",
	"adamation.com",
	"www.adamation.com",
	"microsoft.com",
	"google.com",
	"ibm.com",
	"oracle.com",
	"sparkfun.com",
	"apple.com",
	"netflix.com",
	"www.netflix.com",
	"www.us-west-2.netflix.com",
	"www.us-west-2.prodaa.netflix.com",
	"news.com",
	"hardkernel.org",
	"amazon.com",
	"walmart.com",
	"target.com",
	"godaddy.com",
	"luajit.org",
}



local function queryA()
	local function queryDomain(name)
		local dns = DNSNameServer(serveraddress) -- ms corporate
		print("==== DNS A ====> ", name)
		for record in dns:A(name) do
			local a = IN_ADDR();
    		a.S_addr = record.Data.A.IpAddress

    		print(string.format("name: %s\tIP: %s, TTL %d", name, a, record.dwTtl));
		end
	end

	for _, name in ipairs(domains) do 
		spawn(queryDomain, name)
		--queryDomain(name)
	end
end

local function queryCNAME()
	local dns = DNSNameServer(serveraddress) -- ms corporate
	local function queryDomain(name)
		print("==== DNS CNAME ====> ", name)
		for record in dns:CNAME(name) do
			print(core_string.toAnsi(record.pName), core_string.toAnsi(record.Data.CNAME.pNameHost))
		end
	end

	for _, name in ipairs(domains) do 
		queryDomain(name)
	end
end

local function queryMX()
	local function queryDomain(name)
		local dns = DNSNameServer(serveraddress) -- ms corporate
		print("==== DNS MX ====> ", name)
		for record in dns:MX(name) do
			print(core_string.toAnsi(record.pName), core_string.toAnsi(record.Data["MX"].pNameExchange))
		end
	end

	for _, name in ipairs(domains) do 
		spawn(queryDomain, name)
	end
end

local function querySRV()
	local dns = DNSNameServer(serveraddress) -- ms corporate
	for _, name in ipairs(domains) do 
		print("==== DNS SRV ====> ", name)
		for record in dns:SRV(name) do
			print(core_string.toAnsi(record.pName), core_string.toAnsi(record.Data.SRV.pNameTarget))
		end
	end
end

local function main()
  queryA();
  --queryCNAME();
  --queryMX();
  --querySRV();
end

run(main)

The function queryA() will query for the ‘A’ records, and print them out. Notice that it has knowledge of the giant union structure that contains the results, and it pulls out the specific information for ‘A’ records. It will create a new instance of the DNSNameServer for each query. That’s not as bad as it might seem. All it amounts to is creating a new UDP socket for each query. Since each query is spawned into its own task, they are all free to run and complete independently, which was the goal of this little exercise.

In the case of the CNAME query, there is only a single socket, and it is used repeatedly, serially, for each query.

The difference between the two styles is noticeable. For the serial case, the queries might ‘take a while’, because you have to wait for each result to come back before issuing the next query. In the cooperative case, you issue several queries in parallel, so the total time will only be as long as the longest query.

That’s a good outcome.

I like this style of programming. You go as low as you can to root out where the system might otherwise block, and you make that part cooperative. That way everything else above it is automatically cooperative. I also like the fact that it feels like I’m getting some parallelism, but I’m not using any of the typical primitives of parallelism, including actual threads, mutexes, and the like.

Well, that’s a hefty bit of code, and it serves the purpose I set out, so I’m a happy camper. Now, if I could just turn those unions into tables automatically…


Multitasking single threaded UI – Gaming meets networking

There are two worlds that I want to collide. There is the active UI gaming world, then there’s the more passive networking server world. The difference between these two worlds is the kind of control loop and scheduler that they both typically use.

The scheduler for the networking server is roughly:

while true do
  waitFor(networkingEvent)
  
  doNetworkingStuff()
end

This is great, and works quite well to limit the amount of resources required to run the server because most of the time it’s sitting around idle. This allows you to serve up web pages with a much smaller machine than if you were running flat out, with say a “polling” api.

The game loop is a bit different. On Windows it looks something like this:

while true do
  ffi.fill(msg, ffi.sizeof("MSG"))
  local peeked = User32.PeekMessageA(msg, nil, 0, 0, User32.PM_REMOVE);
			
  if peeked ~= 0 then
    -- do regular Windows message processing
    local res = User32.TranslateMessage(msg)		
    User32.DispatchMessageA(msg)
  end

  doOtherStuffLikeRunningGamePhysics();
end

So, when I want to run a networking game, where I take some input from the internet, and incorporate that into the gameplay, I’ve got a basic problem. Who’s on top? Which loop is going to be THE loop?

The answer lies in the fact that the TINN scheduler is a basic infinite loop, and you can modify what occurs within that loop. Last time around, I showed how the waitFor() function can be used to insert a ‘predicate’ into the scheduler’s primary loop. So, perhaps I can recast the gaming loop as a predicate and insert it into the scheduler?

I have a ‘GameWindow’ class that takes care of creating a window, showing it on the screen and dealing with drawing. This window has a run() function which has the typical gaming infinite loop. I have modified this code to recast things in such a way that I can use the waitFor() as the primary looping mechanism. The modification make it look like this:

local appFinish = function(win)

  win.IsRunning = true
  local msg = ffi.new("MSG")

  local closure = function()
    ffi.fill(msg, ffi.sizeof("MSG"))
    local peeked = User32.PeekMessageA(msg, nil, 0, 0, User32.PM_REMOVE);
			
    if peeked ~= 0 then
      local res = User32.TranslateMessage(msg)
      User32.DispatchMessageA(msg)

      if msg.message == User32.WM_QUIT then
        return win:OnQuit()
      end
    end

    if not win.IsRunning then
      return true;
    end
  end

  return closure;
end

local runWindow = function(self)
	
  self:show()
  self:update()

  -- Start the FrameTimer
  local period = 1000/self.FrameRate;
  self.FrameTimer = Timer({Delay=period, Period=period, OnTime =self:handleFrameTick()})

  -- wait here until the application window is closed
  waitFor(appFinish(self))

  -- cancel the frame timer
  self.FrameTimer:cancel();
end

GameWindow.run = function(self)
  -- spawn the fiber that will wait
  -- for messages to finish
  Task:spawn(runWindow, self);

  -- set quanta to 0 so we don't waste time
  -- in i/o processing if there's nothing there
  Task:setMessageQuanta(0);
	
  Task:start()
end

Starting from the run() function. Only three things need to occur. First, spawn ‘runWindow’ in its own fiber. I want this routine to run in a fiber so that it is cooperative with other parts of the system that might be running.

Second, call ‘setMessageQuanta(0)’. This is a critical piece to get a ‘gaming loop’. This quanta is the amount of time the IO processing part of the scheduler will spend waiting for an IO event to occur. This time will be spent every time through the primary scheduler’s loop. If the value is set to 0, then effectively we have a nice runaway infinite loop for the scheduler. IO events will still be processed, but we won’t spend any time waiting for them to occur.

This has the effect of providing maximum CPU timeslice to various other waiting fibers. If this value is anything other than 0, let’s say ‘5’ for example, then the inner loop of the scheduler will slow to a crawl, providing no better than 60 iterations of the loop per second. Not enough time slice for a game. Setting it to 0 allows more like 3000 iteractions of the loop per second, which gives more time to other fibers.

That’s the trick of this integration right there. Just set the messageQuanta to 0, and away you go. To finish this out, take a look at the runWindow() function. Here just a couple of things are set in place. First, a timer is created. This timer will ultimately end up calling a ‘tick()’ function that the user can specify.

The other thing of note is the use of the waitFor(appCheck(self)). This fiber will block here until the “appCheck()” predicate returns true.

So, finally, the appFinish() predicate, what does that do?

Well, I’ll be darned if it isn’t the essence of the typical game window loop, at least the Windows message handling part of it. Remember that a predicate that is injected to the scheduler using “waitFor()” is executed every time through the scheduler’s loop, so the scheduler’s loop is essentially the same as the outer loop of a typical game.

With all this in place, you can finally do the following:


local GameWindow = require "GameWindow"
local StopWatch = require "StopWatch"

local sw = StopWatch();

-- The routine that gets called for any
-- mouse activity messages
function mouseinteraction(msg, wparam, lparam)
	print(string.format("Mouse: 0x%x", msg))
end

function keyboardinteraction(msg, wparam, lparam)
	print(string.format("Keyboard: 0x%x", msg))
end


function randomColor()
		local r = math.random(0,255)
		local g = math.random(0,255)
		local b = math.random(0,255)
		local color = RGB(r,g,b)

	return color
end

function randomline(win)
	local x1 = math.random() * win.Width
	local y1 = 40 + (math.random() * (win.Height - 40))
	local x2 = math.random() * win.Width
	local y2 = 40 + (math.random() * (win.Height - 40))

	local color = randomColor()
	local ctxt = win.GDIContext;

	ctxt:SetDCPenColor(color)

	ctxt:MoveTo(x1, y1)
	ctxt:LineTo(x2, y2)
end

function randomrect(win)
	local width = math.random(2,40)
	local height = math.random(2,40)
	local x = math.random(0,win.Width-1-width)
	local y = math.random(0, win.Height-1-height)
	local right = x + width
	local bottom = y + height
	local brushColor = randomColor()

	local ctxt = win.GDIContext;

	ctxt:SetDCBrushColor(brushColor)
	--ctxt:RoundRect(x, y, right, bottom, 0, 0)
	ctxt:Rectangle(x, y, right, bottom)
end


function ontick(win, tickCount)

	for i=1,30 do
		randomrect(win)
		randomline(win)
	end

	local stats = string.format("Seconds: %f  Frame: %d  FPS: %f", sw:Seconds(), tickCount, tickCount/sw:Seconds())
	win.GDIContext:Text(stats)
end



local appwin = GameWindow({
		Title = "Game Window",
		KeyboardInteractor = keyboardinteraction,
		MouseInteractor = mouseinteraction,
		FrameRate = 24,
		OnTickDelegate = ontick,
		Extent = {1024,768},
		})


appwin:run()

This will put a window on the screen, and draw some lines and rectangles at a frame rate of roughly 24 frames per second.

And thus, the two worlds have been melded together. I’m not doing any networking in this particular case, but adding it is no different than doing it for any other networking application. The same scheduler is still at play, and everything else will work as expected.

The cool thing about this is the integration of these two worlds does not require the introduction of multiple threads of execution. Everything is still within the same single threaded context that TINN normally runs with. If real threads are required, they can easily be added and communicated with using the computicles that are within TINN.


My Head In The Cloud – putting my code where my keyboard is

I have written a lot about the greatness of LuaJIT, coding for the internet, async programming, and the wonders of Windows. Now, I have finally reached a point where it’s time to put the code to the test.

I am running a service in Azure: http://nanotechstyles.cloudapp.net

This site is totally fueled by the work I have done with TINN. It is a static web page server with a couple of twists.

First of all, you can access the site through a pretty name:
http://www.nanotechstyles.com

If you just hit the site directly, you will get the static front page which has nothing more than an “about” link on it.

If you want to load up a threed model viewing thing, hit this:
http://www.nanotechstyles.com/threed.html

If you want to see what your browser is actually sending to the server, then hit this:
http://www.nanotechstyles.com/echo

I find the echo thing to be interesting, and I try hitting the site using different browsers to see what they produce.  This kind of feedback makes it relatively easy to do rapid turnarounds on the webpage content, challenging my assumptions and filling in the blanks.

The code for this web server is not very complex.  It’s the same standard ‘main’ that I’ve used in the past:

local resourceMap = require("ResourceMap");
local ResourceMapper = require("ResourceMapper");
local HttpServer = require("HttpServer")

local port = arg[1] or 8080

local Mapper = ResourceMapper(resourceMap);

local obj = {}

local OnRequest = function(param, request, response)
	local handler, err = Mapper:getHandler(request)


	-- recycle the socket, unless the handler explictly says
	-- it will do it, by returning 'true'
	if handler then
		if not handler(request, response) then
			param.Server:HandleRequestFinished(request);
		end
	else
		print("NO HANDLER: ", request.Url.path);
		-- send back content not found
		response:writeHead(404);
		response:writeEnd();

		-- recylce the request in case the socket
		-- is still open
		param.Server:HandleRequestFinished(request);
	end

end

obj.Server = HttpServer(port, OnRequest, obj);
obj.Server:run()

In this case, I’m dealing with the OnRequest() directly, rather than using the WebApp object.  I’m doing this because I want to do some more interactions at this level that the standard WebApp may not support.

Of course the ‘handlers’ are where all the fun is. I guess it makes sense to host the content of the site up on the site for all to see and poke fun at.

My little experiment here is to give my code real world exposure, with the intention of hardening it, and gaining practical experience on what a typical web server is likely to see out in the wild.

So, if you read this blog, go hit those links. Soon enough, perhaps I will be able to serve up my own blog using my own software. That’s got a certain circular reference to it.


Optimize At the Right Time and Level

It takes a long time to do the simplest things correctly. You want to receive some bytes on a socket?
int recv(SOCKET s,char *buf,int len,int flags);

No problem right? How about that socket? How did that get setup? Is it right for low latency, big buffers, small buffers? what? And what about that buffer? and lord knows which are the best flags to use? First question I guess is, ‘for what purpose’?

For the longest time, I have known that my IP sockets were not the most performant. Through rounds of trying to get low level networking, all the way up through http handling working correctly, I deliberately made some things extremely brain dead simple.

One of those things was my readline() function. this is the workhorse of http processing. Previously, I would literally read a single character at a time, looking for those line terminators. Which meant making that recv() call a bazillion times during normal http processing.

Well, it has worked great, reduces the surface area to help flesh out various bugs, and allowed me to finally implement the IO Completion Port based networking stack that I now have. It works well enough. But, it’s time to optimize. I am now about to add various types of streams such as compressed, encrypted, and the like. For that I need to actually buffer up chunks at a time, not read single characters at a time.

So, I finally optimized this code.

function IOCPNetStream:refillReadingBuffer()
	print("IOCPNetStream:RefillReadingBuffer(): ",self.ReadingBuffer:BytesReadyToBeRead());

	-- if the buffer already has data in it
	-- then just return the number of bytes that
	-- are currently ready to be read.
	if self.ReadingBuffer:BytesReadyToBeRead() > 0 then
		return self.ReadingBuffer:BytesReadyToBeRead();
	end

	-- If there are currently no bytes ready, then we need
	-- to refill the buffer from the source
	local err
	local bytesread

	bytesread, err = self.Socket:receive(self.ReadingBuffer.Buffer, self.ReadingBuffer.Length)

	print("-- LOADED BYTES: ", bytesread, err);

	if not bytesread then
		return false, err;
	end

	if bytesread == 0 then
		return false, "eof"
	end

	self.ReadingBuffer:Reset()
	self.ReadingBuffer.BytesWritten = bytesread

	return bytesread
end

Basically, whenever some bytes are needed within the stream, the refillReadingBuffer() function is called. If there are still bytes sitting in the buffer, then those bytes are used. If it’s empty, then new bytes are read in. The buffer can be whatever size you want, so if your app is better off reading 1500 bytes at a time, then make it so. If you find performance is best at 4096 bytes per read, then set it up that way.

Now that pre-fetching is occuring, all operations related to reading bytes from the source MUST go through this buffer, or bytes will be missed. So, the readLine() function looks like this now:

function IOCPNetStream:readByte()
	-- First see if we can get a byte out of the
	-- Reading buffer
	local abyte,err = self.ReadingBuffer:ReadByte()

	if abyte then
		return abyte
	end

	-- If we did not get a byte out of the reading buffer
	-- try refilling the buffer, and try again
	local bytesread, err = self:refillReadingBuffer()

	if bytesread then
		abyte, err = self.ReadingBuffer:ReadByte()
		return abyte, err
	end

	-- If there was an error
	-- then return that error immediately
	print("-- IOCPNetStream:readByte, ERROR: ", err)
	return false, err
end

function IOCPNetStream:readOneLine(buffer, size)
	local nchars = 0;
	local ptr = ffi.cast("uint8_t *", buff);
	local bytesread
	local byteread
	local err

	while nchars < size do
		byteread, err = self:readByte();

		if not byteread then
			-- err is either "eof" or some other socket error
			break
		else
			if byteread == 0 then
				break
			elseif byteread == LF then
				--io.write("LF]\n")
				break
			elseif byteread == CR then
				-- swallow it and continue on
				--io.write("[CR")
			else
				-- Just a regular character, so save it
				buffer[nchars] = byteread
				nchars = nchars+1
			end
		end
	end

	if err and err ~= "eof" then
		return nil, err
	end

	return nchars
end

The readLine() is still reading a byte at a time, but this time, instead of going to the underlying socket directly, it’s calling the stream’s version of readByte(), which in turn is reading a single byte from the pre-fetched buffer. That’s a lot faster than making the recv() system call, and when the buffer runs out of bytes, it will be automatically refilled, until there is a socket error.

Well, that’s just great. The performance boost is one thing, but there is another benefit. Now that I am reading in controllable chunks at a time, I could up the size to say 16K for the first chunk read for an http request. That would likely get all the data for whatever headers there are included in the buffer up front. From there I could do simple pointer offset/size objects, instead of creating actual strings for every line (creates copies). That would be a great optimization.

I can wait for further optimizations. Getting the pre-fetch working, and continuing to work with IOCP is a more important optimization at this time.

And so it goes. Don’t optimize too early until you really understand the performance characteristics of your system. Then, only belatedly, do what minimal amount you can to achieve the specific goal you are trying to achieve before moving on.


Exposing Yourself to the InterWebs – The Service Machine

Last time around, I eagerly put myself out onto the internet without a care in the world. Using nothing more than my standard home desktop machine, and features of my home router. I was probeb almost immediately, but it worked.

Next steps, I want to have a machine which is dedicated to the job of dealing with stuff to be accessible from the internet. I purchased one of those “barebones” PC systems from Shuttle. I configured it like this:

SZ77R5
32Gb RAM
intel i73770K processor
ASUS DVD-ROM
Seagate 2Tb drive
Crucial 256Gb mSata SSD
Samsung 256Gb SSD

I did not purchase an additional video card because although the specs are fairly beefy, this is not meant to be a gaming rig.

The first time I built a PC with my daughter, about 10 years ago I think, it was a pair of Shuttle PCs. They’re fairly straight forward to build, just purchase the right components.

This build was fairly straight forward, except for one little piece. When first bringing up the system, I had to connect my standard viewsonic monitor which has vga/dvi/hdmi connectors on it. At first, I tried the HDMI connector, because that typically works. Well, it was a real head scratcher as I never saw anything on the monitor, even though it sounded like it was ‘posting’ to the BIOS. Scratch scratch, go to bed. Next morning, I thought I’d try to swap out the cable to using the DVI connector instead, and voila! that worked.

That’s an interesting bit of news right there. I was having a similar problem with input selection using my Odroid XU trying to startup Ubuntu. Same solution, try to use a monitor that has dedicated HDMI, or use the DVI connector if you can. Once the OS is installed, it will likely figure it all out correctly, but until then, no such luck.

My intention is to install Windows Server 2012. That’s so I can run Hyper-V and have some options as to OS installs and the like. More than likely, I’ll run a server os as one of the VMs, and a client OS just for kicks. I’d like to have a Windows 7 client available, as well as Windows 8. Of course, this is breaking with my intention to have specialized machines, but given the amount of storage available, and the excessive amount of storage, I figure I can have this little bit of fun.

I haven’t installed 2012 as yet, so I thought I’d try out one of those Linux Distros on a stick. I downloaded an instance of Tiny Core Linux, and booted the thing up. No problems. Then I created a USB keychain of the same, and again, no problem. It strikes me that this is a fairly easy way to test whether a machine is fundamentally working or not. I also like the fact that you can just pick up all your goods and move from machine to machine. Reminds me of the Raspberry Pi, and the Odroid, where the whole OS and your storage can be on an SD card, making it easy to move around.

In the coming week, I’ll setup the Windows Server OS images, and start putting content on the machine. Along with a web server, I want to host my various 3D models, and even a blog thing if I can.

It would also be nice to try out putting the machine in ‘the dmz’ and see how much of a gateway I can really make out of the machine. But, that’s a longer term thing.

For now, it’s just fun to have a good solid modern machine that is much quieter than my previous desktop machine, takes up less space, uses less power, and has a lot more capabilities.