The Lazy Programmer looks at PE files
Posted: May 3, 2012 Filed under: Lua, LuaJIT, System Programming Leave a comment »I thought I’d try and apply my bit twiddling skills against an interesting problem. I have reason to look at PE files, or ELF, or Mach-O, and do some interesting stuff based on what I find there. My experience dealing with these executable file formats is fairly limited. I probably knew about the inner workings of Elf back in the day, but I’ve never had any real need to know the depts of PE files.
How hard could it be? It’s just a bunch of bits and bytes right?
The hardest thing for such a challenge is finding a good source of information on the subject. In the case of PE file format, I started by getting the Microsoft PE File Format Specification. You have to accept a license agreement before you can download it. That document is full of information. It’s a bit cryptic in places though. It doesn’t spell out in absolute detail what the DOS header up front looks like. It just says “yah, it’s there, and it should start with “MZ”.
Like all things modern, I turned to the internet to get the real skinny on the format in question. I found one really good tutorial on the PE format, presented by CodeBreakers Magazine. This one is really interesting because it goes into intricate details about what’s in the file, how its used, how viruses are created and stashed away. The only challenge it has is that it doesn’t reflect all forms of the PE file format. It was written in 2006, so it’s missing some information. Still, it makes for a good companion to the MS document because it actually explains everything.
As well as having these raw reference materials, I also found a couple of tools of use. First was HexEdit. This is one of those simple file viewers that allows you to see the file in Hex form, which is great when you’re trying to confirm what you’re reading. Another tool that is very useful is PEBrowse Professional. This tool parses the PE file, and shows you all the detailed information you could care to look at. It includes a disassembler, and can read TYPELIB information if it’s present. That’s great for dealing with COM interfaces.
The way I got started was to find define the meta data for important data structures that can be found in the file. This includes things like:
- IMAGE_DOS_HEADER
- COFF
- PE32Header
- PE32PlusHeader
- IMAGE_SECTION_HEADER
and the like. The stream of code that deals with the first few pieces of information looks like this:
local buff, bufflen = copyFileToMemory("HeadsUp.exe")
local offset = 0
local dosinfo = IMAGE_DOS_HEADER(buff, bufflen, offset)
printDOSInfo(dosinfo)
offset = offset + dosinfo.ClassSize
local ntheadertype = MAGIC4(buff, bufflen, dosinfo:get_e_lfanew())
offset = ntheadertype.Offset + ntheadertype.ClassSize
local fileHeader = COFF(buff, bufflen, offset)
printCOFF(fileHeader)
offset = offset + fileHeader.ClassSize
-- Read the 2 byte magic for the optional header
local pemagic = MAGIC2(buff, bufflen, offset)
local peheader=nil
if IsPe32Header(pemagic) then
peheader = PE32Header(buff, bufflen, offset)
elseif IsPe32PlusHeader(header) then
peheader = PE32PlusHeader(buff, bufflen, offset)
end
printPEHeader(peheader);
That’s a start. From here, the basics are in hand, and other bits and pieces from the file can be loaded in. Not bad from not knowing anything about PE files to being able to browse the basics. Of course, Windows gives you plenty of library routines which make this trivial work, but, perhaps I want to be able to perform this little dance without using Windows at all.
The real work comes when the parsing is more involved, including relative offsets to base addressed stuff, and so on and so forth.
I like the ease of programming like this though. After rough data structures are defined, I’m left with just sliding them into the right positions along a buffer, and reading what I find there.
Serialization 105 – The Lazy Programmer shares data
Posted: May 2, 2012 Filed under: Lua, LuaJIT, System Programming, Uncategorized Leave a comment »Serialization Series
- Serialization 104 – Pixel Packing Pugilism
- Serialization 103 – Getting the Fiddly Bits
- Serialization 102 – CodeGen
- Serialization 101
And finally, to conclude this little serialization excursion. I want to create an application that is a collaborative whiteboard. All the application needs to do is receive a series of drawing command from multiple participants, and execute those commands to render something on each of the screens of the participants. Ideally, code ends up doing something like the following:
local commands = {
MoveTo({x=10, y=10}),
LineTo({x=20, y=20}),
LineTo({x=30, y=30}),
LineTo({x=10, y=10}),
}
NetworkInterface:Send(commands)
And magic happens!
Well, that’s pretty much what can be done now that all these various bit slinging bits and pieces have been assembled. There were a couple of enhancements that needed to be made along the way though. Here is what that MoveTo command looks like:
EMRTypes[EMR_MOVETOEX] = {
name = "MoveTo";
fields = {
{name="id", basetype="int32_t", default= EMR_MOVETOEX},
{name="x", basetype="int32_t"},
{name="y", basetype="int32_t"},
};
}
I have a bit of code that will take this info and turn it into a little class that will deal with all sorts of things. One of the new members on the field specification is a default value. What I can achieve with this is to have a value set on the structure as soon as it’s constructed. So, by default, the id field will have the value of EMR_MOVETOEX, which is a number. So, by default, I could do this:
cmd = MoveTo() network:send(cmd.DataPtr, cmd.BufferSize)
And, what I’d get on the wire is, int32(EMR_MOVETOEX), int32(0), int32(0)
That’s great, because on the receiving end, I’ll first pick up on the ID field, which will tell me which data structure to pull out of the list of data structures, and then I can deserialize using that data structure.
cmd = network:receive()
Of course, the usual field accessors are available, because the autogen’d class has them, so:
cmd:get_x()
cmd:get_y()
These “commands” are just data. I did not send any behavior with them. What happens with the data is up to the recipient. The data could simply be logged, or actual drawing commands could be executed, or the data may be routed to other recipients after sniffing out a couple bits and pieces of information.
Sometimes, you want to set values from the get go though, so as listed above:
cmd = MoveTo({x=10, y=10})
This will allocate just enough space to hold the values for the class. At the same time, it will actually initialize the memory by calling the field setters with the specified values. That’s pretty neat I think.
You can also control the memory that is allocated on your own though:
size = MoveTo.ClassSize * 10 -- space for 10 structures
buff = ffi.new("char[?]",MoveTo.ClassSize * 10)
int offset = 0
cmds = {}
for i=1,10 do
table.insert(cmds, MoveTo(buff, offset, {x=i, y=i*2}))
end
sender:send(buff, size)
That would construct 10 move commands, all in the same preallocated buffer, and send it off somewhere. I think that’s pretty useful.
Another way of seeing how this might be useful is when you consider the typical network packeting challenge.
For something like UDP, you have the following setup:
MAC Header IP Header UDP Header Payload
That’s three headers and a payload. There are a couple of checksums in those headers as well, so it’s nice to have everything together for ease of calculation. Using these various serialization techniques, you could construct a single packet, and then just line up the various header structures appropriately, and set your values. That’s kind of useful because no copying has to occur along the way. No reallocation to prepend a header or any nonsense like that. If you’re doing something as silly as trying to use the IP protocol to communicate between threads on the same machine, this is fairly useful, and resource conservative.
There are other interesting constructs that can be built up here. There is for instance unions. That’s nothing more than throwing two or more layouts on top of the same chunk of memory. Well, that’s a no brainer here. Just define your two layouts, and then apply them to the same chunk of memory. This happens a lot when you’re dealing with something like IPv4 and IPv6. Which address type is it? Oh no, now I have to use that silly IN_STORAGE thing, or whatever.
There is another construct that I had to deal with, and that has to do with strings. Take the typical person data structure:
Person_Info = {
name = "Person";
fields = {
{name = "First", basetype = "char", subtype="string", repeating = 20};
{name = "Middle", basetype = "char", subtype="string", repeating = 20};
{name = "Last", basetype = "char", subtype="string", repeating = 20};
{name = "Age", basetype = "uint16_t"};
{name = "City", basetype = "char", subtype="string", repeating = 32};
{name = "State", basetype = "char", subtype="string", repeating = 32};
{name = "Zip", basetype = "char", subtype="string", repeating = 10};
};
};
There’s a couple of things of note here. First of all, these fields are of a fixed size in a buffer. The strings are not pointers to pieces of memory allocated somewhere else. It’s not tightly packed. This is like a real fixed size “record”. The basetype is “char”, which makes sense, but how do you really want to access this? Ideally, you’d want to do the following:
local p = Person();
p:set_First("William");
p:set_Middle("A");
p:set_Last("Adams");
and likewise, when you access the fields, you want Lua strings returned:
print("First: ", p:get_First())
print("Middle: ", p:get_Middle())
print("Last: ", p:get_Last())
Well, those “strings” are actually just bytes in a buffer. Normally, to convert to strings, you’d have to do something like:
ffi.string(charptr)
But, that’s a pain. So, the ‘subtype=”string”‘ is a bit of information that tells the serialization system that this is really a null terminated C style of string, and setting/getting should treat it as such. So, when you set the value, it will be copied into the buffer, as it should be, and a null terminator will be put in place. So, the size must include space for the null terminator. Similarly, when you go to get the value out again, a Lua string will be interned and you’ll get that back. You’re not pointing to anything within the buffer at that point, it’s not a pointer to the start of the characters.
This is a natural feel for the thing. It just makes sense. If you do actually want to get a pointer for a particular field, then just don’t define the subtype as ‘string’. Then, when you use the getter, you’ll get back a pointer and the size of the field:
ptr, size = p:get_Middle()
That way you’re free to do as you like, send it off somewhere, copy it, set the value, whatever. It’s just raw access within the underlying buffer at that point, with all the nastiness that entails.
Well, that’s enough for me to play with for now. It would be nice to add nested types, enums, and other things you typically find in type systems. Using this here though, I was able to reduce the amount of code in my TargaReader, and I’m now able to take a look at packets coming off the wire, down to the lowest level of protocol.
If you’ve ever done any customized packet filtering with WireShark, this might all seem very familiar. WireShark utilizes Lua to do its network funny business. The only challenge is WireShared does not provide a general libray for others to use for whatever needs they might have. They’ve created a set of routines that work very well within the context of the WireShark environment. I wanted to generalize that sort of thing so that I could apply the techniques to anything from media parsing to protocol sniffing, to collaborative apps.
And here it is. I am a lazy programmer, and this bit twiddling serializing stuff makes my life that much easier, and gives me that much more time to sit around and be lazy.
Serialization 104 – Pixel Packing Pugilism
Posted: May 1, 2012 Filed under: Lua, LuaJIT, System Programming 1 Comment »Do you ever wonder what’s really going on inside that computer? I mean, you layout some data structure, all nice and neat. You pack tightly using bit fields, thinking you’ve written the most tightly packed conservative bit of code known to man. And then you go and try to access the structure, to set or get a value.
Well, truth be told, there’s no such thing as a 2-bit value, so somehow, the machine/compiler coerces that into something it can deal with, either a int8_t, or int16_t, or int32_t, or what have you. Yah, it’s only a couple of instructions, and it all happens transparently, and it’s super scalar (or not), and runs in parallel without stalling, and how many angels can dance on the head of a transistor anyway…
Often times, when you’re dealing with a protocol, you get something in the mail that describes a ‘header’ like this:
The Rtp header has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| synchronization source (SSRC) identifier |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| contributing source (CSRC) identifiers (if mixers used) |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
V = Version
P = Padding
X = Extensions
CC = Count of Contributing Sources
M = Marker
PT = Payload Type
And it looks just like that, complete with ASCII art and everything. Take a look at some of the internet standards if you don’t believe me.
What this is telling me, is that in order to communicate using the Rtp protocol (which I really want to do), I will receive a packet off the networking wire that can be described by this header. It gets a bit dicey after CSRC, but essentially the header is all I’d care about up front. In particular, I’m interested in the sequence number, timestamp, and Payload Type.
Since I have trained myself to describe my data structures in a more abstract form, it looks like this:
RTPHeader_Info = {
name = "RTPHeader";
fields = {
{name = "Version", basetype = "uint8_t", subtype="bit", repeating = 2};
{name = "Padding", basetype = "uint8_t", subtype="bit", repeating = 1};
{name = "Extensions", basetype = "uint8_t", subtype="bit", repeating = 1};
{name = "ContributingCount", basetype = "uint8_t", subtype="bit", repeating=4};
{name = "Marker", basetype = "uint8_t", subtype="bit", repeating=1};
{name = "PayloadType", basetype = "uint8_t", subtype="bit", repeating=7};
{name = "SequenceNumber", basetype = "uint16_t"};
{name = "TimeStamp", basetype = "uint32_t"};
{name = "SSRC", basetype = "uint32_t"};
};
};
Armed with the previously developed CStructFromTypeInfo(info) function, I can easily create the following structure:
typedef struct RTPHeader {
uint8_t Version : 2;
uint8_t Padding : 1;
uint8_t Extensions : 1;
uint8_t ContributingCount : 4;
uint8_t Marker : 1;
uint8_t PayloadType : 7;
uint16_t SequenceNumber;
uint32_t TimeStamp;
uint32_t SSRC;
} RTPHeader;
I can even create easy serialize/deserializers which can go from a stream to this structure and back. But, wait a second. Should I?
The way networking traffic works, you are handed a chunk of data, and even if you are “streaming”, you get a certain amount of data in a fixed sized buffer. Depending on your application, you may choose to copy that data, or just look at values and take some actions based on what you see. If you’re doing something like playing audio, you’re probably going to copy the buffer into another part of the system that deals with playing audio. If you’re just doing some filtering, or some simple command where the buffer isn’t needed, you’ll take take what you want and let it return back to the system unharmed.
So, let’s assume you don’t really want to copy the buffer at all, but you need to get values out of it, or perhaps set some values in it before passing it along. Or, perhaps you’re at the head end, and you want to construct a header package, right there inside the buffer that some other part of the system has handed you. You don’t want to cons up a structure, just for the convenience of being able to use a function to write a value into a memory location, just so you can copy that whole structure into the buffer. And what about the endianness of it all!!
OK. So, clearly there must be another mechanism that can be employed to achieve the following flow:
- Create a buffer of a specific size
- Write values into the buffer at specific bit locations
- Send the buffer into the networking stack without any copying
Well, from the previous expose on bit banging, two functions were created:
setbitstobytes(bytes, startbit, bitcount, value, bigendian) getbitsfrombytes(bytes, startbit, bitcount)
With these two functions, that’s all you need to be able to set a value at any location anywhere in a buffer. That is, any value up to an int32_t. Going beyond that is not impossible as you can always just break it up.
At any rate, this looks like another opportunity to do some codegen. So, given the same exact meta information that was used to create that typedef, and the serializers, I could create an offsets table. Basically, I need a table that contains the following information for each field:
name, offset, number of bits
Easy enough, and the BitOffsetsFromTypeInfo(desc) can do the trick. Given the previous header description, the following table can be generated:
0 2 Version 2 1 Padding 3 1 Extensions 4 4 ContributingCount 8 1 Marker 9 7 PayloadType 16 16 SequenceNumber 32 32 TimeStamp 64 32 SSRC
It is pretty printed here, but in reality, it’s just a standard Lua table where each entry is itself a table containing the essential information we need.
With that, it looks like everyting is in place to auto gen up some field accessors. Those are just simply assembled strings that look like this:
function CreateBufferFieldWriter(field)
return string.format([[
function set_%s(bytes, value)
setbitstobytes(bytes, %d, %d, value);
end
]], field.name, field.offset, field.size);
end
function CreateBufferFieldReader(field)
return string.format([[
function get_%s(bytes)
return getbitsfrombytes(bytes, %d, %d);
end
]], field.name, field.offset, field.size);
end
As a simple example, dealing with the Marker field would look like this:
function set_Marker(bytes, value)
setbitstobytes(bytes, 8, 1, value);
end
function get_Marker(bytes)
return getbitsfrombytes(bytes, 8, 1);
end
There is another function that just kind of ties all this together. The CreateBufferAccessor(desc) function takes our structure descriptor, and autogens all those appropriate accessors.
function CreateBufferAccessor(desc)
local funcs = {}
-- first create the offsets structure
local offsets = BitOffsetsFromTypeInfo(desc)
-- go through field by field and create the
-- bit of code that will write to the buffer
for _,field in ipairs(offsets) do
table.insert(funcs, CreateBufferFieldWriter(field))
table.insert(funcs, CreateBufferFieldReader(field))
end
return funcs
end
At this point we have a table that is full of strings that represent the functions that are used to set and get values on our buffer. And now, finally, to put it to use.
function test_BufferClass()
local funcs = CreateBufferAccessor(RTPHeader_Info)
local funcstr = table.concat(funcs)
-- Now that we have the functions, compile them
-- so we can try to use them
local f = loadstring(funcstr)
f()
-- Finally, try to set some values
-- Create a buffer first to act as the header storage
local buff = ffi.new("uint8_t[2048]")
set_Version(buff, 2)
set_Padding(buff,0)
set_Extensions(buff,1)
set_ContributingCount(buff, 3)
set_Marker(buff, 1)
set_PayloadType(buff, 15)
set_SequenceNumber(buff, 127)
set_TimeStamp(buff, 523)
set_SSRC(buff, 722)
print("Version: ", get_Version(buff))
print("Padding: ", get_Padding(buff))
print("Extensions: ", get_Extensions(buff))
print("Count: ", get_ContributingCount(buff))
print("Marker: ", get_Marker(buff))
print("Payload Type: ", get_PayloadType(buff))
print("Sequence: ", get_SequenceNumber(buff))
print("Timestamp: ", get_TimeStamp(buff))
print("SSRC: ", get_SSRC(buff))
end
The output of running this is:
Version: 2 Padding: 0 Extensions: 1 Count: 3 Marker: 1 Payload Type: 15 Sequence: 127 Timestamp: 523 SSRC: 722
And that’s exactly what I would have expected it to be. It appears that I can in fact round trip values into and out of the buffer. Now, if I were a true nutter, I’d benchmark this, and compare to the case where I’m consing up a structure and doing copying back and forth. But, my gut tells me I could better spend my time optimizing this technique and only optimize if it’s not meeting my needs. The easy of programming is just too tremendous to pass up.
I wrote Rtp code way long back, and it was a lot nastier looking with funny symbols such as ‘~^&|’ littered all over the place, with hard coded constants, and all manner of other bug inducing nasties laying about. With this version of things, I feel much more relaxed, and confident that the implementation is correct. Even when it is not correct, there are only a couple of places to look in order to change things. This also isolates all the endianness into one place as well, so I don’t have to deal with that anywhere in my higher level code.
There are a couple of nifty tricks here that are good from Lua. Since I can execute new code from a string on the fly at any time, I can construct the serialization code on the fly, and just execute it, without every having to actually have the code laying around anywhere in my environment. I know when using other technologies such as protocol buffers, it becomes a pain to change things because you do have to “compile” code, and distribute it. That makes for a very fragile system indeed. In the code here, the description can change, and can even come from another trusted source, and the serialization will just change automatically, as new serialization code can be constructed automatically. That’s a fairly powerful construct, and solves a very challenging maintencance problem. If you add versioning, you can deal with the versioning as well. Just use the proper serializer for the specific version of the header that comes in. It would only be brought in if a header of a new type is seen. Otherwise, you’re never bothered with it.
The last thing has to do with the convenience functions constructed. In the code here, I just created the functions in the global namespace. They could just as easily be stuffed into a list, or into a ‘class’. That’s the beauty of autogen’d code. It’s very small and easy to change to get a maximal effect.
Well, there you have it. I’m able to meet my goals. When I need to write out a new RtpHeader, I just take a buffer (probably from a circular queue), and write in the appropriate header values. Then, I can use the same buffer to fill in the payload information, and finally use the same buffer to write out to the network interface. Assuming the buffer was allocated from the system heap, and I have turned off send buffering in my networking code, this will go straight out without additional copying. And that’s very good indeed.
I like programming like this. I get to be fully lazy, assured that things will just work the way they are supposed to. I can read specs, and code them up fairly painlessly. The next step is to encode the actual protocol itself. It might be that Lua itself is the way to encapsulate the protocol description. No abstraction necessary. We’ll see.
Serialization 103 – Getting the Fiddly Bits
Posted: May 1, 2012 Filed under: LuaJIT, System Programming 1 Comment »Early in the evolution of computing, storage was a scarce commodity. As such, the earliest protocols and data structures packed things as tightly as possible. You can see this all the way down in machine instructions, but at a higher level, all you have to do is take a look at the layout of a modern protocol header:
typedef struct IPv6Header {
uint32_t version : 4;
uint32_t priority : 4;
uint32_t flowlabel : 24;
uint16_t payloadlength;
uint8_t nextheader;
uint8_t hoplimit;
uint8_t source[16];
uint8_t destination[16];
} IPv6Header;
This structure says that the field ‘version’ takes up the first 4 bits (0-3) of an int32, and priority the next 4 bits (4-7) and flowlabel takes the remaining 24 (8-31). If you were writing code in C, you’d naturally have access to these fields just as if they were integers, without any concern as to how they got stuffed into the structure. The compiler will just deal with the access.
Same goes for LuaJIT. For the most part, once you have declared such a CType, you can just access it, without a concern for how it was laid out in memory.
But how do you get the bits and bytes stuffed in there in the first place, if they’re coming off the wire?
Well, depending on how well behaved the data structure is, you might just be able to read the bytes off the wired, and copy them directly into a data structure of the appropriate type, and be done with it. Wash your hands, call it a day…
But, what about endianness? Is it least significant bit first, or most significant? Well now, you’ve gone and ruined your perfect day! As is often the case with networking protocols, the representation is “big endian”. No doubt a vestige of the machines dominant at the time of the birth of the internet (VAX? PDP?). Nowadays, we’re dominated by little endian machines (intel). So, there’s this little bit twiddling dance that often has to occur to get things in the right order. At least the internet specs are clear about endianness, at least for the ones that I’ve read. Images, not always so clear. Many times there is an implied bigendian, but it just depends on what architecture the software was first created for.
So, here I am in Lua, which barely has support for bit operations, and I need to do a lot of bit twiddling. Well, like any good programming effort, I start with the test harness. I need a couple of things. First, I want to be able to test whether a bit is on or off in a number, I also want to be able to set the value of a bit to on or off. Of course, with LuaJIT, the raw bitops are there. But, I can never rember how to count by two, and in particular how to deal with the offsets, and which symbol to use for the power, etc. First step then is to create the functions for my lazy programming:
function isset(value, bit)
return band(value, 2^bit) > 0
end
function setbit(value, bit)
return bor(value, 2^bit)
end
function clearbit(value, bit)
return bxor(value, 2^bit)
end
I’m sure these routines have been written 1000 times on numerous platforms, and probably in Lua as well. But, since they’re not in the standard Lua library, here are my versions.
The next thing I need is the ability to translate from binary string representation of a number to an actual number, and visa versa. To wit:
function numbertobinary(value, nbits)
nbits = nbits or 32
local res={}
for i=nbits-1,0,-1 do
if isset(value,i) then
table.insert(res, '1')
else
table.insert(res, '0')
end
end
return table.concat(res)
end
function binarytonumber(str)
local len = string.len(str)
local value = 0
for i=0,len-1 do
if str:sub(len-i,len-i) == '1' then
value = setbit(value, i)
end
end
return value
end
Perhaps not the most elegant of solutions, but they get the job done, and I can move on to more interesting parts of the bit twiddling.
function getbitsvalue(src, lowbit, bitcount)
lowbit = lowbit or 0
bitcount = bitcount or 32
local value = 0
for i=0,bitcount-1 do
value = bor(value, band(src, 2^(lowbit+i)))
end
return rshift(value,lowbit)
end
And a little test program to convince myself things are working properly:
function test_getbitsvalue()
print(getbitsvalue(0, 0, 8))
print(getbitsvalue(3, 0, 8))
print(getbitsvalue(6, 1, 8))
print(getbitsvalue(0xff, 0, 8))
local bin1 = "11000000"
local n1 = binarytonumber(bin1)
print("Binary 3: ", getbitsvalue(n1, 6, 2))
end
The test program shows how valuable it is to be able to create a human readable bit parttern and then turn that into a number that we can get bits out of. I can construct any manner of bit patterns, and then call on getbitsvalue (after converting to a number) and see if I get out what I expected to get out.
The astute reader will say “hay! you’re just doing little endian, wasn’t the point to be able to deal with big endian as well?”. So true. “left as an exercise for the reader”, but essentially, just reverse bits in the ‘getbitsvalue()’ routine. Throw in a check for big/little and you’re on your way. Although, this is a fairly low level routine, you have to check the specs of what it is that you’re trying to read off the wire. It might be a bitfield that is 4 bits. Are those 4 bits in little or big endian order? I don’t know, and neither does anyone else unless they read the spec. This is most applicable when you’ve got a set of bytes, and you go through the bytes one by one pulling out the bits that you want. If you’ve already read 4 bytes in as an int, then you have to think about whether you need to reverse things or not. You could use bit.swap() first, to get things in the proper order. But, close reading of the spec will tell you which way to go with these raw tools.
At any rate, armed with this little bit of twiddling, I can go back to my stream reader, and incorporate the reading of bitfields. That’s a big relief, because it means I will in fact be able to easily utilize funky protocols like IP to talk between both threads on a local machine, and across the internet.
The Existential Program
Posted: April 30, 2012 Filed under: Lua, LuaJIT, System Programming Leave a comment »In roughtly 1990, I wrote this program called “Who’s Calling?” for the Nascent NeXT computer. The program itself was one of those early day PIM type programs, you know, contacts, calendar, etc. The most interesting thing about it at the time was the ‘security’ mechanism I used to prevent random copying. Basically, it had this little routine it ran when you installed it. It would find a bit of unique information about your machine, and use that as key information to create a digest. Once installed, every time the program launched, it would check that digest, and try to decode based on the bit of information from the currently executing machine. If it got back what it expected, it would continue running.
It was a fairly simple process, that prevented casual copying. In the end, it was probably more trouble than it was worth, but while coming up with that little trick, I headed down this path of exploring identity. I asked myself questions such as “how can a program know when it has been tampered with”? “Does a program have an identity?”. At the same time, I was writing some collaboration apps, and other questions of authorization and authentication came up there as well. If I’m collaborating with several people in a session, who has the right to take over the pen? Does a new user requesting entrance into the collaboration have the right permissions to do so? How do I know who they are?
This was all pre-internet boom, so it was very interesting. Then the internet came along, and we went through a round of Berkeley style “free love, free information, free sharing…”, and then it all came crashing down.
But, now we find ourselves at the internet trough again, and those old questions of identity and authorization are coming back, but it a much more serious way. People are exchanging banking information, and they really don’t want that shared or used by unauthorized entities.
So, I find myself contemplating identity and authorization in a big way, and I find myself going back to the very same old questions. Does a program have an identity? What is the smallest unit of existance? As far as computers are concerned, I can tie this to the idea of the computicle. Does a computicle have an identity? The answer to me is yes, and in fact, this is part of what defines a computicle in the first place.
Looking back, I defined a computicle as having a few key traits:
- Ability to receive input
- Ability to perform some operations
- Ability to communicate with other computicles
There is an implied identity in here. You have to ask the question, “What has the Ability to…?”
You wouldn’t believe how this comes up in some of the most esoteric and mundane places. For example, once I create multiple cooperating threads in my super computer simulator, I have to answer the question “Who’s responsible for managing chunks of memory?”
The scenario is simple. Let’s say I have one computicle which does nothing more than generate keyboard/mouse events, and hands them off to other computicles to deal with. Every time it creates an event, it allocates a chunk of memory, to hand off to the other threads. Now, once that chunk of memory is created, who owns it? The keyboard/mouse generator could retain ownership, but for how long? Does a recipient have a guarantee as to when it might be freed or reused? Should the recipient copy it? How long do they have to copy it before it is stale? Can they share this chunk of memory with others, or what? It’s such a simple thing, but it seems so hard to deal with the combinatoric explosion.
Well, there are a certainly a few ways to deal with things. First, I can go back and look at memory allocation. I created a couple of simple heap structures a while back, and I recently revisited and updated them based on my work with the multiple threads. One of the questions I was trying to answer is, can the memory chunk deal with its own destiny? Said in less philosophical terms, can I use the GC system of Lua to manage the life of a chunk of memory in a reasonable and predictable way.
So, here’s a couple of data structures to deal with Heap allocated memory (Win32):
ffi.cdef[[
typedef struct {
void * Data;
int32_t Size;
HANDLE HeapHandle;
DWORD owningthread;
} HeapBlob;
typedef struct {
HANDLE Handle;
int Options;
size_t InitialSize;
size_t MaximumSize;
DWORD owningthread;
} HEAP;
]]
There are a couple of ways to allocate stuff in Lua. The regular way, without doing anything fancy, is to just create your thing, and start using it:
local myList = {}
myList.name = "William"
myList.job = "Programmer"
Eventually, once there are no references to your little list, it will magically disappear, being swept away by the garbage collection system.
Then, there’s allocating stuff using LuaJIT FFI:
local myChunk = ffi.new("char[256]")
Something allocated in this way will also be garbage collected, when the ‘myChunk’ variable no longer has any references to it. As long as you’re running with a single threaded environment, and you’re not passing your variables off to any other thread to deal with, this will work just fine. But, what if you need to have your chunk of memory last longer, even if you lose all references to it from the creating side.
The third way of creating a chunk of memory is to use a direct system memory allocation call. In the case of Windows, I could do the following:
local ptr = kernel32.HeapAlloc(self.Handle, flags, nbytes)
Ignoring for the moment where that ‘kernel32′ and self.Handle came from, essentially, you’ll get back a pointer to a chunk of memory that Windows knows about. Lua doesn’t know anything about this as a chunk of memory though. I could use ffi and tell the gc about it, and what function to call once ptr is no longer referenced, but really, if that is what you need, then you can just use the previous method of ffi.new(…).
So, here we have a lonely pointer, all allocated, and ready to use. I’d like to capture a bit of information about the creation event. Back to that existential programming, and back to the HeapBlob data structure.
I could do the following to assign the pointer to one of my HelpBlob data structures:
local blob = ffi.new("HeapBlob", ptr)
In this case, I have a HeapBlob struct, with the Data element initialized to the value the pointer held. So far, nothing has changed over the plain old ordinary ffi.new(), expect for the fact that I’ve just introduced a nice memory leak, wrapped up in an object. I need a bit more meat on the bones of the HeapBlob object. I want it to get in on the act of memory management. Specifically, when there are no more references to the object, I want the HeapBlob to deallocate the chunk of memory, thus, eliminating the memory leak:
So, I create a metatype to associate some functions with this simple data structure:
HeapBlob = nil
HeapBlob_mt = {
__gc = function(self)
if self.HeapHandle == nil or self.owningthread == 0 then return nil end
if self.owningthread == kernel32.GetCurrentThreadId() then
local success = kernel32.HeapFree(self.HeapHandle, 0, self.Data) ~= 0
end
end;
__tostring = function(self)
return string.format("Blob: Size: %d Thread: %d",
self.Size, self.owningthread);
end;
__index = {
GetSize = function(self, flags)
flags = flags or 0
if self.HeapHandle == nil then return 0 end
local result = kernel32.HeapSize(self.HeapHandle, flags, self.Data)
return result
end,
IsValid = function(self, flags)
flags = flags or 0
if self.HeapHandle == nil then return false end
local isValid = kernel32.HeapValidate(self.HeapHandle, flags, self.Data)
return isValid
end,
ReaAlloc = function(self, nbytes, flags)
flags = flags or 0
if self.Heap == nil then return false end
local ptr = kernel32.HeapReAlloc(self.HeapHandle, flags, self.Data, nbytes)
if ptr == nil then return false end
self.Data = ptr;
self.Size = nbytes;
return true
end,
}
}
HeapBlob = ffi.metatype("HeapBlob", HeapBlob_mt)
There’s a few functions in here, but the __gc function is the one that gets called whenever an instance of the HeapBlob goes out of scope, and has no further references to it. So, in the simplest case, you clould do the following:
HeapBlob(ptr)
And we’re back to having a handle on a piece of memory, and nothing will occur at __gc time, because we have not set an owningthreadid, and our self.HeapHandle is nil. Well, this is actually a good thing. This can be seen in a two thread scenario.
-- Thread 1 local ptr = Heap:Alloc(256) passMemoryPointerToThread(ptr, 256, takeownership=false, heaphandle=nil) -- Thread 2 local ptr = ReceiveMemoryPointerFromThread(ptr, size, takeownership, heaphandle) local threadid = 0 if takeownership then threaid = Sys.GetCurrentThreadId() end local blob = HeapBlob(ptr,size,heaphandle, 0, threadid)
In this scenario, thread 1 will retain responsibility for the allocated memory. It communicates this information first, by not giving the handle to the heap that was used to allocate the chunk of memory, and second by giving a flag indicating its desire. Really the witholding of the heaphandle is enough of a hint.
Thread 2 is free to go ahead and use the HeapBlob, pass it amongst its friends, etc. Then, when it’s no longer using the blob, the structure will be garbage collected, and nothing will happen to the bit of memory.
This HeapBlob could have additional information, such as a ref count, which can be incremented any time there is a new reference to it, but then it starts to look like a COM program.
If you want to pass along the Heap blob, and allow the second thread to take ownership, thread 1 would simply supply the heap handle. In that scenario, thread2 would have all the information it needs to manage the chunk of memory in ways that it deems appropriate, and it has received an explicit handoff from thread1 imploring it to do so.
This starts to give me better control of memory, which allows me to more strictly control the mechanisms between collaborating threads.
Getting back to Existence, if I can isolate and define one bit of existence, then I’m improving my ability to write fairly independent programs. I haven’t answered the question “who am I”, and “how am I different from another” as yet. But, at least now I have a way to communicate that I can control. I suspect answering the “who am I” thing will involve some GUID or other mechanisms, including ‘claims’ and possibly certificates. That will be an interesting exploration.
Serialization 102 – CodeGen
Posted: April 27, 2012 Filed under: Lua, LuaJIT, System Programming 1 Comment »Emboldened by the possibilities offered by simple BitRead/BitWrite, and a streaming interface, I’ve taken on the next step of the serialization process. Being the lazy error prone coder that I am, I’m looking for a way to have the machine do as much of the coding as possible. The machine is pretty dumb though. You have to very explicitly describe things to it before it will give you any help.
There are many ways to describe a data structure. One of the earlier stated design goals of Lua was for it to be a data description language. Tada! Since my primary goal is to stay within the confines of Lua as much as possible, it makes sense to use Lua itself to describe data structures.
My simplest example:
moveto = {
name = "MoveTo";
fields = {
{name="x", basetype="int32_t", ordinal=1},
{name="y", basetype="int32_t", ordinal=2},
};
};
This is intended to represent a command in my drawing system. Since I’m in Lua, I could just use this structure as it is, but it would be kind of clunky. This type of structure is what you might see with XML Schema and the like. Very verbose, lots of curly braces, very complete, but very unfriendly to the programmer.
But, my data structures don’t change much, so I don’t mind writing it in this form initially, because once in this form, I can easily transform it into other forms. If I want it to show up as a C structure, I have a function call for that:
cstruct = CStructFromTypeInfo(moveto )
print(cstruct);
typedef struct MoveTo {
int32_t x;
int32_t y;
} MoveTo;
Well, that’s convenient. If I wrap that in a ffi.cdev[[]], I will have a structure that I can pass back and forth between the two worlds. If I do:
MoveTo = ffi.typeof("MoveTo");
Then I’ll have a data structure that I can easily access from the Lua side as well. Then my code can easily look like:
myStruct = MoveTo(10,20); graphics:Deliver(myStruct);
Or what have you. But, this is about serialization isn’t it? Well, Serialization starts with a data definition. Depending on what system you’re on, that data is readily available, or it needs to be augmented in some way. In Lua, there really isn’t much of a data type system. Only a solitary type of number, string, null, table, bool, string, and that’s about it. So, you lose out on the other data types, and bitfields, and enums, date… If you were using C++, there’s the Runtime Type Information (RTTI), and that’s a big hairy beast that bloats your code, but gives you some type information at runtime. Similar for Objective-C. C# has a great type system, where you can get all information for any field, fairly easily.
Given enough type information, you should be able to serialize an object/structure, at least at a superficial level. So, what can I do with what I’ve defined here so far? Well, I don’t want to write the serializer, even though it’s only a couple lines of code, so I do the following:
local ser = CTypeSerializer(EMRTypes[EMR_MOVETOEX]) print(ser); function write_MoveTo_ToStream(stream, value) stream:WriteInt32(value.x); stream:WriteInt32(value.y); end
And similar for the deserializer. Of course, this is just a raw function to hydrate a structure from a stream. It is up to a ‘protocol’ to determine when this particular object is to be deserialized, vs any other. Typically, the protocol will place a marker, perhaps the name of the type, in the stream to be read out later.
This is great. I can even describe fairly complex structures with things like bitfields. Here’s a couple from the networking world:
IPv4Header_Info = {
name = "IPv4Header";
fields = {
{name = "version", basetype = "uint8_t", subtype="bit", repeating = 4, ordinal =1};
{name = "headerlength", basetype = "uint8_t", subtype="bit", repeating = 4, ordinal =2};
{name = "typeofservice", basetype = "uint8_t", repeating = 1, ordinal =3};
{name = "totallength", basetype = "uint16_t", repeating = 1, ordinal =4};
{name = "identification", basetype = "uint16_t", repeating = 1, ordinal =5};
{name = "blank", basetype = "uint16_t", subtype="bit", repeating = 1, ordinal =6};
{name = "DF", basetype = "uint16_t", subtype="bit", repeating = 1, ordinal =7};
{name = "MF", basetype = "uint16_t", subtype="bit", repeating = 1, ordinal =8};
{name = "fragmentoffset", basetype = "uint16_t", subtype="bit", repeating = 13, ordinal =9};
{name = "ttl", basetype = "uint8_t", repeating = 1, ordinal =10};
{name = "protocol", basetype = "uint8_t", repeating = 1, ordinal =11};
{name = "headerchecksum", basetype = "uint16_t", repeating = 1, ordinal =12};
{name = "source", basetype = "uint32_t", repeating = 1, ordinal =13};
{name = "destination", basetype = "uint32_t", repeating = 1, ordinal =14};
};
};
IPv6Header_Info = {
name = "IPv6Header";
fields = {
{name = "version", basetype = "uint32_t", subtype="bit", repeating = 4};
{name = "priority", basetype = "uint32_t", subtype="bit", repeating = 4};
{name = "flowlabel", basetype = "uint32_t", subtype="bit", repeating = 24};
{name = "payloadlength", basetype = "uint16_t"};
{name = "nextheader", basetype = "uint8_t"};
{name = "hoplimit", basetype = "uint8_t"};
{name = "source", basetype = "uint8_t", repeating = 16};
{name = "destination", basetype = "uint8_t", repeating = 16};
};
};
In the case of the Ipv4Header, I use all the bells and whistles (except enum). There are different types, bitfields, repeatings, and ordinals. I left out the “required” attribute, because the default is ‘true’. In the IPv6 case, I even left out the ordinals, as it assumes they are in the order specified (must iterate using ipairs). Depending on what you’re doing, even the field name might not be necessary. For example, if you’re implementing a stream processor, you just need to know the type, not the name.
The following data structures are created from these type info lists.
typedef struct IPv4Header {
uint8_t version : 4;
uint8_t headerlength : 4;
uint8_t typeofservice;
uint16_t totallength;
uint16_t identification;
uint16_t blank : 1;
uint16_t DF : 1;
uint16_t MF : 1;
uint16_t fragmentoffset : 13;
uint8_t ttl;
uint8_t protocol;
uint16_t headerchecksum;
uint32_t source;
uint32_t destination;
} IPv4Header;
typedef struct IPv6Header {
uint32_t version : 4;
uint32_t priority : 4;
uint32_t flowlabel : 24;
uint16_t payloadlength;
uint8_t nextheader;
uint8_t hoplimit;
uint8_t source[16];
uint8_t destination[16];
} IPv6Header;
And of course, their equivalent (de)serializers can be auto generated in the same way the first one was.
The amount of code needed to auto gen the structure, or the serializers is trivial. That means, it’s just as easy to do the same for any language you care to use. Many serialization schemes in use today use this mechanism. It’s quick and dirty, but it’s a bit bloated, and fragile when it comes to versioning things.
Google has this serialization technology called Protocol Buffers. Their flavor for communicating with things. I found this one project (came from within Google apparently) where they do streaming based Protocol Buffer parsing, instead of filling up existant objects. The upb project follows the same methodology as the old SAX XML parser. That is, register callbacks, and when the parser sees something in the stream, it calls your callback, and you do whatever you want with the results. Thus, the protocol does not force you to use structures, if that’s not appropriate for your usage pattern.
In my case here, I could utilize something like a upb parser, and either fill my structures, or not. The code is trivial to change, so it’s not a design choice I have to make up front. I can simply do what’s natural for whatever the situation is.
With these mechanisms in hand, it starts to become interesting as to what can be done automatically. These are just raw structures though. Typically, when trying to deserialize something, like an image, things get more complicated. The data structure alone is not enough to tell you how to get stuff out of the stream. If there’s compression involved, for example, you need to know about that and do something. So, this technique is useful, and it’s a good downpayment on more complex mechanisms, but it’s just the beginning.
This is good enough though for most typical communications between threads in a multi-threaded program. Now I can trivially create some “commands”, serialize them into memory packets, and send those packets between threads. A no brainer!
Serialization 101
Posted: April 26, 2012 Filed under: Lua, LuaJIT 1 Comment »As soon as you want to communicate with something, you have to worry about how you’ll represent your data. If you lookup binary serialization on the inter webs, you’ll get a laundry list full of history and methodologies. For a few years, I had the pleasure of pushing XML as the end all be all data representation format. Nowadays, JSON is very popular, because it matches JavaScript, which is very popular.
The basic problem is, I have data on my machine represented in some machine readable form, and I want to communicate it somewhere else, possibly on a different architecture, and probably across the internet.
In the case of binary data, there is always the need for a very low level serializer that can deal with things like reading and writing simple types, such as bytes, int, float, double, short, string. This lowest level thing takes care of endianness, and reliably reading and writing to the native format of the machine. I have implemented BitReader, and BitWriter classes. They deal with byte arrays, so they are fairly agnostic as to how thos byte arrays came to be.
BitReader ReadByte(bytes) ReadInt16(bytes) ReadUInt16(bytes) ReadInt32(bytes) ReadUInt32(bytes) ReadInt64(bytes) ReadSingle(bytes) ReadDouble(bytes) BitWriter WriteByte(bytes, value) WriteInt16(bytes, value) WriteUInt16(bytes, value) WriteInt32(bytes, value) WriteUInt32(bytes, value) WriteInt64(bytes, value) WriteUInt64(bytes, value) WriteSingle(bytes, value) WriteDouble(bytes, value)
This is all in the name of convenience of course. You might notice that there is no Read/WriteString() methods. That’s because a string does not have a universal representation as a data type. Is it a list of ASCII characters followed by a terminating null, or is it UTF-8 encoded, or is it like a Pascal string with a leading length, followed by ASCII or something else? So, at this very lowest level, there is no concept of a string. But, you do want to deal with strings, and it comes at the next level up.
Bit stuffing and unstuffing is very low level and fundamental. At a higher level, you need to deal with the actual buffers, and move pointers along in the buffer that you’re reading and writing. This is where the BinaryStreamReader/Writer comes in.
Similar to the low level bit reader/writer pair, these stream versions add some convenience. Here is an example of a reader and writer in action:
function test_Int()
local len = 1024;
local bytes = Array1D(len, "uint8_t");
local writer = BinaryStreamWriter.CreateForBytes(bytes,len);
writer:WriteInt16(32);
writer:WriteInt32(958);
writer:WriteInt16(2301);
writer:WriteInt32(23);
local reader = BinaryStreamReader.CreateForBytes(bytes, len);
assert(reader:ReadInt16() == 32);
assert(reader:ReadInt32() == 958);
assert(reader:ReadInt16() == 2301);
assert(reader:ReadInt32() == 23);
end
First, I create a buffer, using Array1D, of a sufficient size. Then I create a binary stream writer on top of the buffer. The writer will just keep track of where we are, and whether we’ve written off the end of the buffer. Every call to Write, will essentially pass through to the BitWriter.
Same thing is true of the binary stream reader. As these are separate functions, you could actually create both a reader and a writer on the same buffer, at the same time, with no problem, other than worrying about whether you’re overwriting something you don’t want to.
This is where the convenience methods for strings are implemented as well:
function test_string()
local len = 1024;
local bytes = Array1D(len, "uint8_t");
local writer = BinaryStreamWriter.CreateForBytes(bytes,len);
writer:WriteString("this is a whole long string that I want");
local reader = BinaryStreamReader.CreateForBytes(bytes, len);
local str = reader:ReadString();
print(str);
end
I was actually able to use this in my Targa loading code like this:
fHeader.IDLength = reader:ReadByte();
fHeader.ColorMapType = reader:ReadByte();
fHeader.ImageType = reader:ReadByte();
fHeader.CMapStart = reader:ReadInt16();
fHeader.CMapLength = reader:ReadInt16();
fHeader.CMapDepth = reader:ReadByte();
-- Image description
fHeader.XOffset = reader:ReadInt16();
fHeader.YOffset = reader:ReadInt16();
fHeader.Width = reader:ReadUInt16();
fHeader.Height = reader:ReadUInt16();
fHeader.PixelDepth = reader:ReadByte();
fHeader.ImageDescriptor = reader:ReadByte();
It may not look like much, and if you’re a user of the .net frameworks, you’ll be doing a big yawn about now, because this is the same pattern that’s existed in that world since roughly 2001. But, this is Lua, and the standard libraries don’t support this, so here it is. There are various libraries that support one or another of the various serialization schemes, but they are typically meant to support that scheme only, and don’t necessarily generalize the binary reading and writing aspect of things.
Well, now that this little tool is in hand, other things can come. When I have data to send somewhere, it’s a fairly easy matter to write some serialization code that will take my various attributes, and use the binary stream writer to stuff them into a chunk of memory. Once there, it looks like a payload, and it can be delivered to any other interface that accepts chunks of memory. From the last article, I’ve shown that it can be a separate thread that receives this payload, or it could be a machine across the net. Once received, the binary stream reader is used to unpack stuff, and on you go.
The serialization process can be tedious, and when implemented by hand, error prone. Many frameworks try to help by writing serializers based on an abstract description of the data structure. I’m thinking, with Lua, I don’t need another format to represent data structures. I don’t need to resort to XML, ASN.1, JSON or anything else. I can just describe things in Lua, either using a table, with names and and data types, or some other mechanism. From the same description, I can generate a serializer on the fly, and get it JIT’d, which will be a very good thing indeed. I could also generate serializers for other languages fairly easily if needs be.
At any rate, armed with this tool, putting together things like protocol headers becomes a snap. Then, doing something like a remote desktop app, where you’re sharing bits of a screen, also become routine. You just construct payloads big enough to hold your screen bits, and send them away.
Once things are in payloads, it becomes relatively easy to do other things such as compression, and encryption. That might be useful if youi’re trying to construct a secured communications channel. But, I digress…
Super Computing Simulation Sure is Simple
Posted: April 26, 2012 Filed under: Lua, LuaJIT Leave a comment »A few years ago I had this idea about computing. I thought, what is the smallest unit of compute power? Then I thought about trying to define exactly what I meant by this, and how it might be rendered in reality. Thus was coined the phrase “Computicle”. Basically, a particle of computation.
A computicle is a basic unit of computation, not defined by NAND gates and transistors, but by pure definition of words and mechanisms. I have come to believe the essential attributes of a computicle are the following:
- Ability to receive input
- Ability to perform some operations
- Ability to communicate with other computicles
That’s it, and after reading it, it doesn’t sound like it’s worth much of anything. But, it’s amazing how hard it is to clearly define a small group of words precisely enough that they can be used to express a large variety of outcomes. Look at DNA for instance. There are the 4 base pairs, very simple. And there are rules for which combine with which, again very simple. Given those simple rules, and the soup that life stuff is made of and you can express everything from an ant to a zebra, and everything else in between, including plants.
So, do these computicle attributes buy me anything? Well, I’ll take a step back, and look at why I’d even bother pondering such things in the first place.
Increasingly, putting together a program with any level of complexity is getting harder and harder. I used to be able to program in assembly, way back when it was 6502, or 8088. I am completely bewildered by today’s machines, and would much rather rely on a compiler to do it for me. Putting together a hard core web service is a very challenging task. There are tons of little pieces, and lots of ways to do it. Even so, it is at that 6502 assembly stage. The only problem is, the machine has not been completely defined as yet. We have bits and pieces such as TCP, JSON, HTTP, storage, and compute. But, they don’t all flow into a coherent “machine” that can be reasoned about and easily programmed. As such, there are no ways to really express a web service, such that a compiler could come along and optimize the heck out of it. We’re all programming in assembly. Even worse, the assembly is specific to the particular platform and service being deployed. This is back to the cores and tubes days of early computing.
So, rolling forward, I’d like to use the constraints of the computicles to help me define what it is to create a massively parallel thing. I’d first like to be able to express the “machine” that I am programming, and then use computicles to express that machine in real terms, and then apply a program to that machine.
Well, that’s kind of lofty stuff. So, I’d better start much smaller. I have previously played with running Lua instances in separate threads. One of the things that I left as an exercise for the reader was the ability to communicate between those threads. Recently I’ve baked thread support into HeadsUp, and gone the next couple of steps to be able to easily create a thread that runs some lua script. And, the next step is showing how to actually talk to such a running thread.
First of all, within HeadsUp, there is some code to run a lua script. It is in fact the same code that HeadsUp itself uses to run a script. It’s just exposed, and has a signature such that it can be used as a startup routine for the CreateThread call. The signature looks like this:
DllExport int RunLuaScript(void *s);
Great. Now, I already had this object, LuaScriptThread. This object takes a script, in the form of a lua string, and creates a thread, and sends the script to RunLuaScript() to be executed. This is the easiest way of running a bit of Lua script in a separate thread from whatever thread you so happen to be calling from.
Here is an instance of doing such a thing:
local ffi = require "ffi";
require "LuaScriptThread"
local user32 = ffi.load("user32");
local kernel32 = ffi.load("kernel32");
local path = ffi.new("char["..(string.len(package.path)+1).."]");
ffi.copy(path, ffi.cast("char *",package.path), string.len(package.path));
function testLooper()
local looper = LuaScriptThread(simplechunk, path);
-- Give the thread a chance to start
kernel32.Sleep(1000);
local maxIterations = 0xffff;
local counter = 1;
local bRet;
cmd = C.WM_COMMAND;
while (true) do
if counter > maxIterations then
looper:Quit();
return ;
end
bRet = user32.PostThreadMessageA(looper.ThreadId,cmd,counter,0);
if bRet == 0 then
local err = C.GetLastError();
print(string.format("Error: 0x%x", err));
kernel32.Sleep(1000);
end
counter = counter + 1;
end
end
testLooper();
There are three parts to this piece of code. The first part is simply concerned with getting the right modules loaded, and creating a copy of the current inclusion path, so it can be passed along to the thread when it starts.
The second part of the program is the creation of the thread of execution:
local looper = LuaScriptThread(simplechunk, path);
As soon as this line is executed, the OS is creating a separate thread to execute whatever Lua script code is indicated by “simplechunk”. The ‘path’ parameter is being passed along as the single argument to the code. Really, this could be anything at all, including a whole large Lua program in and of itself. This thread startup, and the parameter being passed go hand in hand. Usually, whomever creeates the threads, knows what parameter to pass to the thread on startup. It’s generally a good idea to at least pass the current path, but it completely depends on the code that is to run in the thread.
The last part of this code snippet is concerned with sending messages to the newly created thread. In this case, using PostThreadMessage() will do the trick. On Windows, any thread the starts calling “GetMessage()”, will have a message queue associated with it. This is typically done with apps that have Windows associated with them, but that’s not strictly required. it used to be that you had to create this fake window to get this queue, but these days, it seems to work without having to create that fake window.
PostThreadMessage() is a non-blocking call. You just call it, and you’re done. Whatever you sent will be pulled out of the queue by the thread routine, and they’ll do whatever they want with it.
So, how about that thread routine?
local simplechunk = [[
local ffi = require("ffi");
local path = ffi.string(ffi.cast("char *",tonumber(_ThreadParam)))
package.path = path;
require("win_user32");
local user32 = ffi.load("User32")
require("MessagePrinter")
printer = MessagePrinter();
local msg = ffi.new("MSG")
local bRet = user32.GetMessageA(msg,nil,0,0);
while (bRet ~= 0) do
printer:Receive(msg);
bRet = user32.GetMessageA(msg,nil,0,0);
end
]];
Again, represented by a couple of parts. The first part deals with getting the path, which was passed as a parameter (which shows up as _ThreadParam). That parameter is passed as a string representation of a pointer to the data which represents the argument. It has to be turned back into a number, and then cast to a char *, which can then be converted to a string, and then we can set the package.path. This allows us to then load scripts, just like the original program did.
There is one thing to note about this thread instance of the Lua environment. Unlike the HeadsUp Lua environment, this environment does not have anything but the standard libraries loaded into it, not even the core stuff like class, and the like. So, anything you want to use, you have to “require” or ‘ffi.load’ to get.
But, this is a beach head. From this stub, you can do anything, including recursively creating even more threads. In fact, this is a good way for a “service” to make instances of stuff. But, I digress.
The second half of the program contains the GetMessageA() loop. In this case, I’m doing something very simple. Just get the message, and print it out. The message printer is implemented in a class in a separate file, which is ‘require(d)’ to get it in:
require "BanateCore"
class.MessagePrinter()
function MessagePrinter:_init()
end
function MessagePrinter:Receive(msg)
print(string.format("Message: 0x%x wParam: 0x%x lParam: 0x%x",
msg.message, msg.wParam, msg.lParam));
end
Again, fairly brain dead simple. And here, I start to see the constraints of computicles expressing themselves. The MessagePrinter might be a computicle. It receives data, and does something.
The thread routine is the same, it receives data, and it does something.
One thing that would help here is to standardize the “receives data” part of things. In this example, I’m using two different mechanisms. In one, I’m using PostThreadMessage() to send data to a computicle. In another, I’m calling the “Receive()” method on an object. In both cases, I’m essentially communicating the same information, just using different communications mechanisms.
Well, that’s easily standarizable. Let’s first encapsulate the data in some payload, which could be defined as:
typedef struct {
int size;
char *data[1];
}payload;
Or something like that. The idea being, whatever data you’ve got, you can pack into a data structure, which tells how big the data is.
After you’ve stuffed the buffer using whatever means available, you’re ready to send the payload off to somewhere. Well, “To Somewhere” could mean many different things. It could be a routine running on a thread, in a process, on an OS, on a machine, in a vault, in the US, or it could be something across the internet. Clearly, some form of addressing needs to occur. The internet has IPV4, but that only gets you halfway there, depending, and that’s the wrong kind of address when I’m communicating with threads within the same process.
Hmmm, what to do… Well, there needs to be an addressing mechanism, so I’ll just wish it into existance. Now, I need a message dispatch mechanism. The message dispatcher will take the address, and the payload, and ensure the payload gets to the address. So, the generalized communications flow looks like:
construct payload get address of recipient deliver payload to recipient [rinse repeat]
Pretty simple in theory, and actually, pretty simple in implementation as well I think.
Now, an interesting thing begins to happen to my program. Payloads can represent protocols. There’s no reason the protocol has to be connected to the transport mechanism… I can talk TCP/IP between threads in a process, without involving the networking stack in any way. As long as I have a payload packer that knows how to pack TCP/IP packets, and a transport that knows how to talk TCP/IP, it can work.
Hmmm, that’s interesting. I can easily implement “Ping” between two threads. Just as easily, or perhaps easier, than I can do it between internet nodes. Well, isn’t that cheerful! And if I can implement these low level protocols, then I can implement higher level protocols as well, because they simply layer atop these lower ones.
OK. Now to the sweet Jesus moment. If I’ve got these simple computicles, which can communicate using a variety of mechanisms, then I can simulate my distributed cloud based service? Why yes, and why not? It’s a royal pain to try and simulate a distributed services that is based in 8 different countries around the world, without actually deploying it for real. I’d like to be able to simulate the latencies, redundant packets, packet loss, down links, and all the rest. Well, with my little computicles, I might begin to have a chance.
Similarly, I want to simulate a deployment of a large scale services. Let’s say 100 nodes, even spread accross 8 data centers. Then I want to do a live update, and take some nodes down in the process.
Oh, and while I’m at it, of course I want to be able to visualize what the heck is going on without much fuss.
Is this total fantasy? Well, at least I’ve nailed down the running of Lua script in a thread, and communicating with it thing. Now it’s a matter of implementing some more intelligent computicles to string together a grammar which can express some machine configurations. Then I’ll be cooking with gas.
Why is there so much muck in my program???
Posted: April 25, 2012 Filed under: LuaJIT Leave a comment »In order to do some really good programming, the kind where you get to focus on your core competencies, requires that you have libraries and frameworks that are very easy to use, and hit the spot just right in terms of complexity and power.
Since I have been doing a lot of interop to various libraries, I’m seeing first hand how complex this can be, and how evil and wrong some libraries can be, and how beautifully simple others are. One of the conclusions I have come to, is for modern day programming, a library that has a brain dead straight simple C interface is the best possible way to go. Not all C interfaces are created equally though. The ones I like typically look like this:
int getlasterror(); context_t ctx = create_context(params); func1(context_t ctx); func2(context_t ctc, int param1);
That is, create some sort of handle, that the library of routines understands, and will do something with. Then, call various routines in the library, passing in the context handle that you were given by the library in the beginning.
Somewhat similar, but extremely unsatisfactory is this pattern:
context_t ctx; int err = create_context(&ctx);
I don’t like this because I have to come up with a mechanism to pass in a pointer to be filled in. Well, depending on the language that I’m using, passing in a pointer to something may or may not be possible, so it’s just a couple bits harder to do interop with than the first mechanism. The one positive tradeoff here is that the function can have a return value that you can check immediately, instead of having to call something like “getlasterror()” that the first mechanism utilizes. This is useful in an environment where youi might be doing multi-threading, but since I can handle multi-threading in other ways, this is not a big bonus for me.
The Windows COM environment is full of this second mechanism, only it goes even further, which brings me to the next pattern.
objectinterface *objectptr; int err = getobject(&objectptr); *objectPtr->vtbl->func(params);
In this case, the library tries to do too many things on my behalf. It’s trying to return an “object” to me, which is a structure with a ‘vtable’, which has pointers to functions, which I can then call. This is just plain hard and ugly. I might not be able to deal with these function pointers from my language, and that passing in a pointer to a pointer thing is always kind of problematic, even from C (at least for me). In this case, I’d actually be happy with using the first method, and creating my own class construct appropriately for my preferred language. In the case of Lua, this would be a snap, as LuaJIT FFI will give me ready access to the function signatures, and I can create a “class” fairly easily using a table. By trying to use this last method instead, I’m just in for a world of hurt, pain and suffering, and will spend most of my time doing interop rather than actually using the library in question.
But, all is not well, just because you have a straight C interface to your library. I was recently trying to get the OpenGL interop working “perfectly”. Basically, I wanted to have the option of constructing a GL context for any given version, and being specific about the pixel attributes and other things.
Well, GL has this funky dance you have to do in order to create a proper context. That dance involves creating a fake window, a fake context, getting pixel attributes, creating your real window, setting pixel format on that window, destroying the old fake window, getting the GL context. All told, it’s about 100 lines of convoluted code, that is very finicky, and sensitive to the order in which things are done. There is esoterica at many levels as depending on which version of a GL context you’re trying to construct, things get very hairy as to defaults, what behaviors you’ll get, etc.
But, now that the fragile mess is done, I can forget about that part and just continue using OpenGL happily. Luckily, OpenGL interfaces are of the first variety. I am free to create my own class constructs where I feel I need them, and don’t bother in places where I don’t. Thank goodness, otherwise, I’d hate to think how life would have been if I had to swallow the whole of OpenInventor just to get some triangles displayed on the screen.
I can only hope that anyone creating a modern library that is meant to be used by dynamic languages, will fight the urge to provide too much, and will just go back to some basic C roots, as that truly is a universal and simple base to start from.
Unchaining the GPU with Lua and OpenCL
Posted: April 23, 2012 Filed under: HeadsUp, Lua, LuaJIT, OpenCL, OpenGL Leave a comment »Quite a few years ago, I programmed the BeBox to display multiple streams of .mpg video, while simultaneously pulling in video feeds from Satellite and cable. In all, you could see snapshots of roughly six things on the screen, happening all at the same time.
The CPUs were utilized primarily for the mpeg part, doing decoding, and some special effects when changing sources being displayed in the primary area. The feeds coming off the Happauge video capture card were being DMAd directly into the framebuffer of the graphics card, so there wasn’t any work by the CPU going on there.
That was a pretty good result for a dual-proc machine circa 1996. That was at the very beginning of the birth of nVidia, and GPUs were actually first becoming mainstream from 3dfx. Roll forward 16 years… and where are we today?
Well, the machine whining away under my desk is a 3.4Ghz AMD Phenom(tm) II X4 965 Processor, with 8Gb of RAM. The graphics card is an nVidia gfx 275. This machine is a couple years old now, but compared to that BBox, it’s a monster from another planet. As such, you would think it would be able to perform the same feats as that old machine, without even heating up a single resistor. To make it even more of a monster, there’s that GPU sitting in there which has 1000 times over the amount of processing power utilized to send people to the moon in the sixties.
So, what can this machine do? Well, It allows me to type really fast!! I can read emails in the blink of an eye, and Netflix movies play with nary a stutter! I tell you, it’s simply amazing! But, what about all that horsepower that’s sitting idle under my desk? Surely I can put it to some good usage.
Well, of course graphics processing can largely be offloaded to the GPU these days. Although I conjured up a graphics library that lives complely on the CPU, and just draws to memory, doing the same using the GPU is far faster, and takes a lot less electricity.
And finally, I come to the point. I have gotten far enough along with my OpenCL binding that I can now actually do some OpenCL programming. OpenCL is an interesting little thing. Basically, it introduces the concept of ‘kernel’ programming. And here, Kernel does not mean the OS kernel, but rather the small little bit of code that will run in parallel on the same piece of memory that other little bits of code are running against. This is in fact what happens when you’re running a GLSL shader. It’s just a little ‘kernel’, and in the case of a fragment shader, that little kernel runs against all the pixels in a frame, in parallel with hundreds of others doing the same thing.
Using GLSL based fragment shaders is great for graphics programming, but for general computing, it’s kind of clunky as you’d have to cast your compute problem into terms that the graphics pipeline can understand. Furthermore, in order to use GLSL at all, you have to do things like create a GLContext, which requires a DeviceContext, which requires a Window, or at least a GDIBitmap. That’s a lot of machiner to just write a bit of code to manipulate some data.
OpenCL changes things a bit. First of all, you have access to the GPU power without the graphics constructs. You still have to create a proper context, but it’s far easier without having to worry about windows and bitmaps. There are some concepts, and a hierarchy for doing things. You start at the top with platforms. There may be multiple “platforms” within your machine. Usually there is only one though. Within a platform, there are devices. There may be multiple devices in a platform. For example, you might have two nVidia cards in your machine, and that will list as two devices.
After the device, there is the concept of a context. The context can span multiple devices. The context controls things like where memory is created, where programs are created, where kernels are run, and the like. This is really where things start to get interesting.
From the context, you can create a “program”. Here, I think it is easier to think of the program as “image”. You are essentially placing an “image” onto the context. I think of the image as the raw OS image, ready to have little bits of code running in it.
Then, finally, you can create a “kernel”, which is actually a piece of code that’s going to execute on the device.
That’s a lot of stuff, and a lot of error checking, and a lot of pointers that can go wrong, etc. So, the Lua version looks like this:
local platform, num = CLGetPlatform() local devices = platform:GetDevices(CL_DEVICE_TYPE_GPU) runkernel(devices[1]);
That is, get the first plaform available. Then, get the list of devices available on the platform. And finally, run a kernel (code below).
Using Lua is nice because garbage collection can be used to release various resources when they’re no longer in use. That saves a bit of typing, and you don’t have to remember anything.
To run a kernel, I looked at a simple example in C, written by Clifford Wolf.
local program_source = [[
__kernel void simple_demo(__global int *src, __global int *dst, int factor)
{
int i = get_global_id(0);
dst[i] = src[i] * factor;
}
]];
function runkernel(device)
local context = CLContext():CreateForDevice(device);
local program = context:CreateProgramFromSource(program_source);
program:Build();
local NUM_DATA = 100;
local buffsize = ffi.sizeof("int")*NUM_DATA;
local input_buffer = context:CreateBuffer(buffsize, CL_MEM_READ_ONLY);
local output_buffer = context:CreateBuffer(buffsize, CL_MEM_WRITE_ONLY);
local factor = 2;
local lpfactor = ffi.new("int[1]", factor);
local kernel = program:CreateKernel("simple_demo");
kernel:SetIndexedArg(0, input_buffer.Handle, ffi.sizeof("cl_mem"));
kernel:SetIndexedArg(1, output_buffer.Handle, ffi.sizeof("cl_mem"));
kernel:SetIndexedArg(2, lpfactor, ffi.sizeof("int"));
local queue = context:CreateCommandQueue(input_buffer);
local intsize = ffi.sizeof("int");
local lpi = ffi.new("int[1]");
for i=0, NUM_DATA-1 do
local offset = intsize*i;
lpi[0] = i;
queue:EnqueueWriteBuffer(input_buffer, offset, lpi, intsize);
end
local global_work_size = ffi.new("size_t[1]",NUM_DATA);
local kernel_completion = queue:EnqueueNDRangeKernel(kernel, global_work_size);
kernel_completion:Wait();
kernel_completion:Release();
print("Result:");
local lpdata = ffi.new("int[1]");
for i=0, NUM_DATA-1 do
local offset = i*intsize;
local err = ocl.clEnqueueReadBuffer(queue.Handle, output_buffer.Handle,
CL_TRUE, offset, intsize, lpdata, 0, nil, nil);
CL_CHECK(err, "clEnqueueReadBuffer");
print(lpdata[0]);
end
end
In the first part of runkernel(), I’m using the nice object like interface that the Lua binding provides. In the last part of the function, I’m using the straight OpenCL calls, just to show how that’s done.
There are a couple of things of note here. First, the ‘program_source’ is just a string. This is the same as with GLSLProgram. There are various environments available, including from nVidia, which will help you create these kernel strings. Once you have your string perfected, you can just drop it in for inclusion as your kernel.
Since a kernel is not a function in lua that you can just pass variables to, you have to do some explicit work to pass values in as arguments. kernel:SetIndexedArg() performs this task. This is an ideal candidate for some Lua magic to make it simpler. Unlike the GLSL interface, I can’t query the program to find out the types of the various arguments. But, since I wrote the kernel, I do know their types, so, I write a little table that maps the index to a name, and the data values, and this code could turn into a more familiar:
kernel.src = input_buffer kernel.dst = output_buffer kernel.factor = 2
Then I’d be happy as a clam. There is another concept that gets in your face here. That’s the whole queuewrite, queueread business. Basically, all data and kernel deployment happens as commands executed from a queue. That fact does not need to be front and center, and a little bit of wrapping might make it nicer to deal with.
Now that this is in hand, what can be done with it? Well, there’s the obvious graphics stuff, which is where it came from, but there’s a whole lot more. I was just thinking that this might be a great way to perform base64 encoding for example. It’s a largely parallel task. You could write a kernel that turns a 3-character block into the equivalent 4-character code. As this kernel can run in parallel, you could literally have hundreds of them working on encoding your text at the same time. At the end, you’ve got a base64 encoded thing, in a fraction of the time it would normally take.
Using a slightly different approach, that of stream processing, you could probably perform some cryptographic operations, like digest calculations and the like.
There is one tool that I found that makes exploring OpenCL fairly easy and fun. OpenCL Studio is done by Geist Software Labs, who appear to be a consultancy for high performance computing. They have a nice Lua scriptable environment that allows you to play with OpenCL and OpenGL, just like that.
Having such a tool available is an accelerant for me to get even more productivity wrung out of myself, and my machine.
With my little Lua Binding to OpenCL, I am confident that I’m going to be able to get more per killowatt out of my programming. That’s good for my programs, and good for the environment. I’m hoping that between a fast quad-proc, super duper graphics card, and Lua, I’ll finally be able to write and utilize programs that are more impressive that what I could do 15 years ago.