Serialization 102 – CodeGen

Emboldened by the possibilities offered by simple BitRead/BitWrite, and a streaming interface, I’ve taken on the next step of the serialization process. Being the lazy error prone coder that I am, I’m looking for a way to have the machine do as much of the coding as possible. The machine is pretty dumb though. You have to very explicitly describe things to it before it will give you any help.

There are many ways to describe a data structure. One of the earlier stated design goals of Lua was for it to be a data description language. Tada! Since my primary goal is to stay within the confines of Lua as much as possible, it makes sense to use Lua itself to describe data structures.

My simplest example:

moveto = {
    name = "MoveTo";
    fields = {
        {name="x", basetype="int32_t", ordinal=1},
        {name="y", basetype="int32_t", ordinal=2},
    };
};

This is intended to represent a command in my drawing system. Since I’m in Lua, I could just use this structure as it is, but it would be kind of clunky. This type of structure is what you might see with XML Schema and the like. Very verbose, lots of curly braces, very complete, but very unfriendly to the programmer.

But, my data structures don’t change much, so I don’t mind writing it in this form initially, because once in this form, I can easily transform it into other forms. If I want it to show up as a C structure, I have a function call for that:

cstruct = CStructFromTypeInfo(moveto )

print(cstruct);

typedef struct MoveTo {
	int32_t x;
	int32_t y;
} MoveTo;

Well, that’s convenient. If I wrap that in a ffi.cdev[[]], I will have a structure that I can pass back and forth between the two worlds. If I do:

MoveTo = ffi.typeof("MoveTo");

Then I’ll have a data structure that I can easily access from the Lua side as well. Then my code can easily look like:

myStruct = MoveTo(10,20);
graphics:Deliver(myStruct);

Or what have you. But, this is about serialization isn’t it? Well, Serialization starts with a data definition. Depending on what system you’re on, that data is readily available, or it needs to be augmented in some way. In Lua, there really isn’t much of a data type system. Only a solitary type of number, string, null, table, bool, string, and that’s about it. So, you lose out on the other data types, and bitfields, and enums, date… If you were using C++, there’s the Runtime Type Information (RTTI), and that’s a big hairy beast that bloats your code, but gives you some type information at runtime. Similar for Objective-C. C# has a great type system, where you can get all information for any field, fairly easily.

Given enough type information, you should be able to serialize an object/structure, at least at a superficial level. So, what can I do with what I’ve defined here so far? Well, I don’t want to write the serializer, even though it’s only a couple lines of code, so I do the following:

local ser = CTypeSerializer(EMRTypes[EMR_MOVETOEX])
print(ser);

function write_MoveTo_ToStream(stream, value)
	stream:WriteInt32(value.x);
	stream:WriteInt32(value.y);
end

And similar for the deserializer. Of course, this is just a raw function to hydrate a structure from a stream. It is up to a ‘protocol’ to determine when this particular object is to be deserialized, vs any other. Typically, the protocol will place a marker, perhaps the name of the type, in the stream to be read out later.

This is great. I can even describe fairly complex structures with things like bitfields. Here’s a couple from the networking world:

IPv4Header_Info = {
    name = "IPv4Header";
    fields = {
    {name = "version", basetype = "uint8_t", subtype="bit", repeating = 4, ordinal =1};
    {name = "headerlength", basetype = "uint8_t", subtype="bit", repeating = 4, ordinal =2};
    {name = "typeofservice", basetype = "uint8_t", repeating = 1, ordinal =3};
    {name = "totallength", basetype = "uint16_t", repeating = 1, ordinal =4};
    {name = "identification", basetype = "uint16_t", repeating = 1, ordinal =5};
    {name = "blank", basetype = "uint16_t", subtype="bit", repeating = 1, ordinal =6};
    {name = "DF", basetype = "uint16_t", subtype="bit", repeating = 1, ordinal =7};
    {name = "MF", basetype = "uint16_t", subtype="bit", repeating = 1, ordinal =8};
    {name = "fragmentoffset", basetype = "uint16_t", subtype="bit", repeating = 13, ordinal =9};
    {name = "ttl", basetype = "uint8_t", repeating = 1, ordinal =10};
    {name = "protocol", basetype = "uint8_t", repeating = 1, ordinal =11};
    {name = "headerchecksum", basetype = "uint16_t", repeating = 1, ordinal =12};
    {name = "source", basetype = "uint32_t", repeating = 1, ordinal =13};
    {name = "destination", basetype = "uint32_t", repeating = 1, ordinal =14};
    };
};

IPv6Header_Info = {
    name = "IPv6Header";
    fields = {
    {name = "version", basetype = "uint32_t", subtype="bit", repeating = 4};
    {name = "priority", basetype = "uint32_t", subtype="bit", repeating = 4};
    {name = "flowlabel", basetype = "uint32_t", subtype="bit", repeating = 24};
    {name = "payloadlength", basetype = "uint16_t"};
    {name = "nextheader", basetype = "uint8_t"};
    {name = "hoplimit", basetype = "uint8_t"};
    {name = "source", basetype = "uint8_t", repeating = 16};
    {name = "destination", basetype = "uint8_t", repeating = 16};
    };
};

In the case of the Ipv4Header, I use all the bells and whistles (except enum). There are different types, bitfields, repeatings, and ordinals. I left out the “required” attribute, because the default is ‘true’. In the IPv6 case, I even left out the ordinals, as it assumes they are in the order specified (must iterate using ipairs). Depending on what you’re doing, even the field name might not be necessary. For example, if you’re implementing a stream processor, you just need to know the type, not the name.

The following data structures are created from these type info lists.

typedef struct IPv4Header {
	uint8_t version : 4;
	uint8_t headerlength : 4;
	uint8_t typeofservice;
	uint16_t totallength;
	uint16_t identification;
	uint16_t blank : 1;
	uint16_t DF : 1;
	uint16_t MF : 1;
	uint16_t fragmentoffset : 13;
	uint8_t ttl;
	uint8_t protocol;
	uint16_t headerchecksum;
	uint32_t source;
	uint32_t destination;
} IPv4Header;

typedef struct IPv6Header {
	uint32_t version : 4;
	uint32_t priority : 4;
	uint32_t flowlabel : 24;
	uint16_t payloadlength;
	uint8_t nextheader;
	uint8_t hoplimit;
	uint8_t source[16];
	uint8_t destination[16];
} IPv6Header;

And of course, their equivalent (de)serializers can be auto generated in the same way the first one was.

The amount of code needed to auto gen the structure, or the serializers is trivial. That means, it’s just as easy to do the same for any language you care to use. Many serialization schemes in use today use this mechanism. It’s quick and dirty, but it’s a bit bloated, and fragile when it comes to versioning things.

Google has this serialization technology called Protocol Buffers. Their flavor for communicating with things. I found this one project (came from within Google apparently) where they do streaming based Protocol Buffer parsing, instead of filling up existant objects. The upb project follows the same methodology as the old SAX XML parser. That is, register callbacks, and when the parser sees something in the stream, it calls your callback, and you do whatever you want with the results. Thus, the protocol does not force you to use structures, if that’s not appropriate for your usage pattern.

In my case here, I could utilize something like a upb parser, and either fill my structures, or not.  The code is trivial to change, so it’s not a design choice I have to make up front.  I can simply do what’s natural for whatever the situation is.

With these mechanisms in hand, it starts to become interesting as to what can be done automatically.  These are just raw structures though.  Typically, when trying to deserialize something, like an image, things get more complicated.  The data structure alone is not enough to tell you how to get stuff out of the stream.  If there’s compression involved, for example, you need to know about that and do something.  So, this technique is useful, and it’s a good downpayment on more complex mechanisms, but it’s just the beginning.

This is good enough though for most typical communications between threads in a multi-threaded program.  Now I can trivially create some “commands”, serialize them into memory packets, and send those packets between threads.  A no brainer!


One Comment on “Serialization 102 – CodeGen”


Leave a comment