Loading Multiple Lua States

Here’s the thing.  Ultimately I want to use I/O completion ports on Windows to create highly scalable networking infrastructure.  I don’t want to write a lick of C code, I want to stay completely within the confines of Lua.  So what to do?  Well, first off, when you work with IOCP, you need to assign some threads to deal with the completion of various IO operations.  OK.  Easy enough, with LuaJIT, just create the interop function to the CreateThread() system call.  But wait, I want to execute some lua code in that thread…

Alrightly then, why don’t I just pass in the lua state that I have now and…

Never mind.  What I really need to do is create a thread, and within that thread I need to create a lua state object, and have that object execute the little bit of code that is needed.  With LuaJIT, this is totally possible.  Normally, when you create a lua_State, you are writing code in standard ‘C’, or whatever environment.  But, with LuaJIT< the lua51.dll is just as accessible as any other library in the world, so you can simply access it and create your state as your normally would in C.

But, there’s a rub, first you need that massive ffi definitions file that mimics the appropriate .h files.  And so, first I had to create Luaffi.lua, which is part of the BanateCoreWin32 files.  What is this file?  It is basically an amalgamation of the various header files that are used to create Lua.   Namely, it includes the contents of: luaconf.h, lua.h, lauxlib.h, and lualib.h

This was a fairly mindless task of copying over the appropriate part, deciding whether a #define was a simple alias to something, a function, or a constant, and putting ffi.cdef[[]] around all the appropriate functions.  It compiles cleanly, but that does not mean everything actually works correctly.  I’ll have to go through it a few times to ensure everything is actually correct.

There is one big Warning in there.  There is a constant, that comes from stdio: BUFSZ

On Windows, this is defined as 512, so I explicitly set the value to 512.  This value is completely dependent on your system, and should be set appropriately.  Ideally, this would be a value that could be queried in the system, because this is the most fragile bit of this interop.  It’s not used in too many places, but when it is, it will likely break things.

And so, what do you get for your troubles?

local ffi = require "ffi" lua = require "Luaffi"
function report_errors(L, status) 
    if status ~=0 then 
        print("-- ", ffi.string(lua_tostring(L, -1))) 
        lua_pop(L, 1); -- remove error message 
function execlua (codechunk) 
    local status, result local L = lua.luaL_newstate();  -- create state

    if L == nil then 
        print("cannot create state: not enough memory") 
        return EXIT_FAILURE; 

    -- Load whatever libraries are necessary for 
    -- your code to start 
    print("luaopen_base: ", lua.luaopen_base(L)) 
    print("luaopen_io: ", lua.luaopen_io(L)); 
    print("luaopen_table", lua.luaopen_table(L)); 
    print("luaopen_string", lua.luaopen_string(L)); 
    print("luaopen_math", lua.luaopen_math(L)); 
    print("luaopen_bit", lua.luaopen_bit(L)); 
    print("luaopen_jit", lua.luaopen_jit(L)); 
    print("luaopen_ffi", lua.luaopen_ffi(L));

    -- execute the given code chunk 
    result = luaL_dostring(L, codechunk) 
    print("Result: ", result) 

    report_errors(L, status)


execlua("print('hello lua')")

That’s a fairly standard looking “main()” for using Lua.  Basically, do everything within LuaJit itself, without having to use a lick of C code.  Now that I have this basic capability, the rest of the task is fairly straight forward.

One challenge ahead is how to communicate between threads.  Well, I’m a big believer in message passing.  From years of doing multi-threaded, multi-processor programming, I know full well that I’m not good at maintaining shared memory state.  Over the years, I have found that the best way to keep things straight is to simply pass messages.  Granted, debugging an asynchronous message passing system is no walk in the park either, but it makes for much more easily scalable systems.  By using message passing, you can focus on ensuring that the messaging mechanism between processes works correctly, and forget the rest.  This style also lends itself easily to being distributed, either across processes, or across the internet, which is a good thing.

One way to pass messages between threads on Windows is to use the PostThreadMessage() function.  Each thread will be running in a little message loop, and when a message comes in, it can be placed on a queue within the lua state, and the executing code (which should be in a coroutine) can pull it out and deal with it at its leisure.

Of course, to extend more broadly, PostThreadMessage can be aliased with something more interesting, like PostIPNode, or whatever.  As long as the function can take a Blob and send it to its destination, I shouldn’t really care.

So, there it is.  First steps towards making a highly scalable Lua based processing engine.


Performance, structures matter…

Performance tuning is always fun.  The very best programmers in the world, know how each and every choice they make will have an effect based on their machine architecture, runtime, compiler, etc.  I guess I’m not one of those types of programmers, because I’ve long since lost track of how what I code affect lowest level microcode.  But, I occasionally make crude attempts to speed things up at a much higher level.

One area I focused on recently was my basic frame buffer.  I had originally coded this as a single array of bytes, or whatever data structure was specified.  I had this convenient FixedArray2D class which made allocating them easy, by wrapping up some ffi goop to get at native data structures.  The downside of this class was that it had “SetElement” and “GetElement”.  Each of those calls did a calculation to determine where in the vast byte array the particular element was.  OK, that’s a no-brainer.  Mostly that calculation is going to be very quick, and maybe even inlined, etc.  But, how about just doing away with it and letting the compiler deal with it, just like in C.

So, I ditched it.  Now, I have some other convenience functions that do the job much more nicely and succinctly.

Array2D = function(columns, rows, kind)
    return ffi.new(string.format("%s[%d][%d]", kind, rows, columns))

Which can be used thus:

local window = Array2D(captureWidth, captureHeight,  "pixel_BGRA_b")
local graphPort = Array2DRenderer.Create(captureWidth, captureHeight, window, "pixel_BGRA_b")

In this case, I’m creating a pixel buffer which is a certain size, and filled with a BGRA pixel type.  Then I construct a renderer on top of that, and move on with life.  This works out quite nicely and gives you easy access to the data using normal 2 dimensional matrix access:

window[row][column] = value

Not only is this much more convenient, it turns out to be much faster as well.  I guess the runtime is taking care of the appropriate calculations, inlining, turning it into machine code, and done.  No more function calls in the way or anything.  Just plain fast code.  I’ve even changed the renderer to use regular array assignments.  For example, with LineH (draw a horizontal line), I used to do a memcpy essentially.  Now, I figure, iterating over the locations, making assignments, might be just as fast as the memcpy, so I just do that, and let the compiler figure out how to optimize it, as this should be easily optimizable.

In the end, what did this get me?  Well, the graphics fill rate has gone up.  I can now draw thousands of tiny little triangles in realtime (30 fps), without much sweat to the system.  That same task was getting bogged down at around 2048 triangles using my previous structures.  So, this is an improvement.

Now I’m casting an eye towards matrix and vector speedups.  At the moment, I represent my matrix class as a 16 element array.  This is convenient for OpenGL, but it’s pretty inconvenient for virtually everything else.  Again, I have to litter the code with offsets and the like, and probably hurt my performance.  I’ll just switch this to being an Array2D of “double” and see where that gets me.  I could just use Lua Tables, but I’m not sure if there’s a higher cost to Lua tables doing the appropriate lookus, or if there’s a higher cost converting types between lua numbers and the native types I’m storing in my structures (typically float).  We’ll see.  Since multiplying a vec3 by a mat4 is the hotpath in graphics processing, making this path as fast as possible is a very desirable thing.

In the meantime, fill rates have gone up.  This bodes well for other data structures, like the ones needed to compress bits of video screen before sending them off to the network.  Fast access, at a very low level will be highly beneficial there.

Model View Project Perspective Screen…

There is an extremely long walk that a pixel has to take from the model to the screen.  In order to recreate the render pipeline, I have to take care of all those transforms all by myself.  Starting from the last one, there is a Viewport transform.  The viewport transform takes “normalized” device coordinates (-1,1 in x-axis and y-axis) and turns them into actual screen coordinates.  I have implemented a ViewportTransform object which performs this particular task.  It’s fairly straight forward:

local vpt0 = ViewportTransform(captureWidth, captureHeight)

Then, to transform a point:

local v11 = vpt:Transform(vec3(-.75, .25, 0))

Since these points have to be normalized, they have to be specified in values between -1, and 1 on all axes.  What will be returned is a vec2, which will be in the range of captureWidth, captureHeight.  That’s real nice.  You can easily deal with top down, or bottom up, just by changing the sign of the captureHeight.

In the above picture, I actually setup 4 different viewport transforms:

local vpt1 = ViewportTransform(captureWidth/2, captureHeight/2, 0,0)
local vpt2 = ViewportTransform(captureWidth/2, captureHeight/2, captureWidth/2, 0)
local vpt3 = ViewportTransform(captureWidth/2, captureHeight/2, captureWidth/2, captureHeight/2)
local vpt4 = ViewportTransform(captureWidth/2, captureHeight/2, 0, captureHeight/2)

Then, when it comes time to render, I just run the vertices through each of the transforms, and I will receive coordinates that are placed in the appropriate quadrant.  That’s just how it works in OpenGL as well, although there is the advantage of a hardware assist in that case.

This is a good thing because it means I can draw different views of the same thing simply by changing the viewport.  That’s what you see in those highend CAD systems where they show different perspectives on the model while you’re working on it.

Similarly, it allows you to fairly easily parcel up the screen, and do things like draw windows, or separate between the 3D Rendering, and the 2D UI portion of the screen.

But, this particular transform works with those normalized values.  In order to get from the model to these normalized values, there are a couple more transforms.  One is to transform from the model’s view of the world to the ‘camera’s view of the world.  That includes a modelview transform, projection transform and a perspective division.  Not too bad.  Once those are in place, there is a complete 2D rendering pipeline, and triangles can show up again.

Calculating Colorful Pictures

Here is a Mandelbrot set coded up using the BanateCore (SetPixel).  Co particular color palette, so just raw value from 0 – 255, turned into grayscale.

In the earliest days of my programming on a Commodore PET, I had fun doing stuff with graphics.  At that time, one of the biggest challenges I had was doing a floor fill routine in 6502 assembly.  We’re a far cry, and quite a few gigahertz beyond the speed of that early machine, but I find myself once again staring at images that were fun to create about 30 years ago.

Nowadays, with the BanateCore, it’s fairly straight forward to simply think of my graphics environment as not much more sophisticated than a VGA frame buffer circa 1980 or so.  You know, select a mode, then setpixel, and you’re done.  So, here’s another machination using those very basics.  A rendering of the mandelbrot set.

Way back in another day, I had the benefit of working with Benoit Schillings while at Be Inc.  Benoit is quite an extraordinary programmer.  Way back then, on a lowly BeBox, I saw him code up a mandelbrot viewer that allowed you so navigate and zoom around in realtime.  Mind you, this was about 15 years ago, with a lot less horse power than we have today.  What I’ve produced here is nothing as slick as what we had then, but it is the basics.  My goals is to be able to tweak the speed until it is satisfactory for realtime navigation.

From what I’ve seen over the years, one of the key challenges in Mandelbrot presentation is selecting the proper color scheme.  In this case, I simply used a frequency converter.  That is, 0 == lowest frequency, 255 == highest frequency.  Turn that into an RGB value, and that’s that.

Same coloring scheme, except in this case it just so happens to look much more interesting because of this particular region of the fractal.

What has been fun about this little experiment is that it’s just one more case where I’m able to wring some more performance out of the system, and simplify the API design that much more.  Little applets like this can be created in roughly 200 lines of code.  My goal is to remove about 50 lines of code from that equation, so that rendering such simple things is so brain dead simple that anyone could do it.

Kinect to Lua

Good News everyone!  At least that’s how one of my favorite electronics supply shops AndaHammer.com starts every post.   I finally got the Kinect to spit out some color information, and I can consume it using LuaJIT!

What a long and arduous road it has been!  One of the things I can say from my experience is, if you’re going to use the Microsoft SDK 1.0 for Kinect, first uninstall all other drivers you might have been fooling around with, and reboot.  This caught me for the longest time.  I was getting sporadic behavior, errors with invalid pointers, and all manner of frustration.

After Uninstalling the CLNui stuff, and rebooting, the proper device driver for the Kinect installed itself, and things improved.

The next challenge had to do with the format of the data coming off the Kinect.  The color data is a somewhat funky format.  It is RGB-32.  Not RGB-24 (8 bits per component, 24 bits per pixel).  RGB-32 is 8 bits per component, but it is actually 32 bits per pixel.  The remaining 8 bits are ignored.  They are not alpha.  If they were, the value would be set to 255.  But, since they are “ignored”, the final 8-bits are set to ‘0’.  Well, that’s a bit of a bother.  In my ideal world, I’d simply represent the data as a RGBA pixel, and be done with it.  But, I can’t.  What I need is a RGB-32, which will have an OpenGL signature of RGB.  Well, not the hardest thing in the world, but it will take some small changes to my frameworks to make it happen.

In the meanwhile, I just change my GLTexture object to have an internal format of RGB (ignoring alpha), and that fixed the problem.

What you see in the picture is an upside down representation of the view out my apartment window.  Yep, another one of those little things.  The picture is bottom up, or top down, depending on your perspective.  It’s probably top down coming off the camera, and my OpenGL view of it is flipping it.  But, the data is there.

A little bit about the code.  First of all, it’s all in GitHub.  You can find the Nui low level interop stuff in the BanateCoreWin32 repository under the Kinect directory.  In the directory, there is the Kinect.lua file, which is the simplified wrapper on top of the core Nui calls.

How to use it?  First you initialize a sensor, based on an index value:

require "Kinect"
local sensor0 = Kinect.GetSensorByIndex(0, NUI_INITIALIZE_FLAG_USES_COLOR)

In this case, I am initializing the first sensor in my system (you can have multiples), and telling it I’ll only be grabbing color information, no skeleton tracking or depth information.

Then, when I want to grab a frame and copy bits, I do this:

local success = sensor0:GetCurrentColorFrame()
local colorAccessor = Array2DAccessor({
    TypeName = "Ppixel_BGRA_b",
    Width = captureWidth,
    Height = captureHeight,
    Data = sensor0.LockedRect.pBits,
    BytesPerElement= 4,
    Alignment = 1,



At the moment, you MUST make the call to ReleaseCurrentColorFrame, or you’ll find you’re not getting any frames soon enough.

This is the “polling” style of getting frames from the device.  The preferred way is to actually use an eventing model where the device will alert you when frames are available.  My first run through though, I just wanted to get at the pixels as soon as possible, so I polled.

This device is very finicky.  It doesn’t like to be polled too much, and there are tons of error cases that you need to be aware of and deal with properly.  At this point though, I know I am talking to it successfully.  I even managed to make the tilt go up and down, which is nice.

Now that I have a code path that is basically working, I can move on to making it much nicer to use.  For example, it would be nice to represent the camera as an iterator, where I can just essentially say GetNext(), and get the next instance of the color frame, whithout having to worry about releasing things.

At any rate, there it is, ready for other people to hack on and improve.

Walking the roads of madness

Well, it started innocently enough.  I wanted to add some very fast core networking interop to my code.  So, I started with the Winsock library on Windows because I want to use I/O completion ports.  Alright, fair enough, and an easy job was made of doing basic Berkeley style network socket calls.  Then along came some nefarious data structures, notably, WSAPROTOCOL_INFO.  Buried within that data structure is the innocent GUID.

Well, of course implementing GUID didn’t lead directly down the path of COM, but it was so innocently close I could not resist.  So, down the COM rabbit hole I went.

COM is a very ancient technology.  It was invented in the early days of Microsoft wanting to embed components such as pictures, spreadsheets and the like into Word documents.  It was such a great idea, that over the past 20 years or so, the world has had the benefit of living the glory of a COM existance through Windows.

COM is many things.  It is a type system, a runtime, a programming way of life.  There are tons of interfaces, functions, and the like.  There’s stuff for “server” side, and “client” side, proxies, asynchronous stuff.  The list goes on and on.  There are type libraries, which sometimes lie, and AddRef/Release for “garbage collection”.  There are BSTRs, Variants, SafeArray, and a whole host of other data types, conversions between the data types, and on and on it goes.

If all you want is access to your Kinect though, what to do?

You can start off easily enough:


sensor = CreateObject(IID_INuiSensor);

Once you have a handle on the sensor, you can make some calls.

imageStream = sensor.NuiImageStreamOpen(...)

sensor.NuiImageStreamGetNextFrame(imageStream, 30, imageFrame)

Then go and do something with the frame.  This isn’t exactly the code you would write, but it’s pretty close to what will be there eventually.

To make stuff work in COM, there’s a lot of work that has to be done with interfaces.  It’s just like any other code, you have to define the various structures and method definitions.  In some cases, you might have what’s known as a type library (.tlb) file, which is a binary encapsulation of the interface to the objects.  In most cases though, all you have is a header file with all the GUID and interface goop laid out in barely readable form.

For the first few cases, I am hand coding the interfaces from header files.  A better approach long term will be to use the MIDL compiler to create a .tlb file, then read the .tlb file to get all the type and method information.  Once I can do that, I can construct the definitions from their source and ditch the hand transcription.

There is another possibility that occurs to me though.  Rather than constructing a bunch of code that mimics all the interfaces known to man, since Lua is much more dynamic, interfaces can be constructed on the fly.  If I have the .tlb files for something, then I can read it at runtime, and using the __call() metamethod, just call the appropriate method when the user makes the call.  Basically, I could construct the appropriate ffi.cdef[[]] construct, then make the call, as long as the library has already been loaded into memory.

If that technique works, it would be the bees knees.  No need to generate large amounts of code, just ship the .tlb files, and leave it at that.

At any rate, dealing with “legacy” coding paradigms can be a real pain some times.  The benefit is getting access to a bunch of stuff, which can then be incorporated with new code and paradigms.


The March on COM Continues

Motoring along with a COM interop module…

After GUID/UUID, the next thing to deal with is BSTR.  I’m not much of a COM programmer, but there are two things over the years that I’ve always wanted to stay at arms length from.  The first is the BSTR, and the second is the VARIANT.  These things were possibly greatly useful in their day, but these days, they’re just kind of anacronistic.

At any rate, BSTRs are a beastly must, because in the Kinect API, there are BSTRs.  It seems a bit odd to me that given the way the world works today with dynamic languages, we could not come up with easier interfaces, that don’t require the support of a full framework such as COM to do something as asimple as “GetNextFrame”.  But, there you have it.

I have managed to do the code wrangling to get the Nuixxx interfaces from the Microsoft Kinect SDK into “compilable” shape.  That is, I looked at the SDK headers, and transformed them bit by bit into what I thought was correct form for interop with LuaJIT.  I haven’t actually tested them as yet, because the rest of the COM support isn’t quite fixed.  But, soon enough, I might be able to grab frames from the Kinect.

At any rate, the BanateCoreWin32 codebase has taken on quite a lot of new APIs and whatnot.  If you’re interested in looking at such things, you might head on over to the GitHub and browse around.

Along the way, I actually ran into a stumper, and none other than Mike Pall gave me some hints and tips to make things better.  The primary thing I had to deal with was forward declarations of structures.  A common construct such as:

typedef struct AStructure AStructure

typedef struct otherstruct {
AStructure *food;
} otherstruct;

typedef struct AStructure {
otherstruct *areference;

The Microsoft COM headers are full of stuff like this, except they use the word “interface” instead of ‘struct’.  In most cases, I could just change the word to ‘struct’, and the rest is fine, and things work out.

While he was at it, Mike pointed out that the construct that I was using:

floatv = ffi.cdef("float[?]")

wasn’t the most efficient thing in the world.  Although it’s convenient when you’re doing more interesting structures than basic types, it’s not particularly performant when dealing with base types such as numbers.  Instead, it’s better to just do:

vec3 = ffi.cdef("float[3]")
vec4 = ffi.cdef("float[4]")

You still can’t assign a metatable, as far as I know, but you get the easy constructor semantics:

vec3(x, y, z)

So, things are humming along.  Having spent a day on the bowels of COM interop, I feel even more strongly that such interfaces need to be left far behind.  There is too much time and energy spent on trying to massage the right things to happen.  In comparison, doing interop to straight C interfaces is almost mind numbingly simple.  Of course, it was great for the time, and solved a lot of problems, but I wonder what would happen if interfaces today were defined in something like Lua, instead of IDL and COM, etc.