Loading Multiple Lua States

Here’s the thing.  Ultimately I want to use I/O completion ports on Windows to create highly scalable networking infrastructure.  I don’t want to write a lick of C code, I want to stay completely within the confines of Lua.  So what to do?  Well, first off, when you work with IOCP, you need to assign some threads to deal with the completion of various IO operations.  OK.  Easy enough, with LuaJIT, just create the interop function to the CreateThread() system call.  But wait, I want to execute some lua code in that thread…

Alrightly then, why don’t I just pass in the lua state that I have now and…

Never mind.  What I really need to do is create a thread, and within that thread I need to create a lua state object, and have that object execute the little bit of code that is needed.  With LuaJIT, this is totally possible.  Normally, when you create a lua_State, you are writing code in standard ‘C’, or whatever environment.  But, with LuaJIT< the lua51.dll is just as accessible as any other library in the world, so you can simply access it and create your state as your normally would in C.

But, there’s a rub, first you need that massive ffi definitions file that mimics the appropriate .h files.  And so, first I had to create Luaffi.lua, which is part of the BanateCoreWin32 files.  What is this file?  It is basically an amalgamation of the various header files that are used to create Lua.   Namely, it includes the contents of: luaconf.h, lua.h, lauxlib.h, and lualib.h

This was a fairly mindless task of copying over the appropriate part, deciding whether a #define was a simple alias to something, a function, or a constant, and putting ffi.cdef[[]] around all the appropriate functions.  It compiles cleanly, but that does not mean everything actually works correctly.  I’ll have to go through it a few times to ensure everything is actually correct.

There is one big Warning in there.  There is a constant, that comes from stdio: BUFSZ

On Windows, this is defined as 512, so I explicitly set the value to 512.  This value is completely dependent on your system, and should be set appropriately.  Ideally, this would be a value that could be queried in the system, because this is the most fragile bit of this interop.  It’s not used in too many places, but when it is, it will likely break things.

And so, what do you get for your troubles?

local ffi = require "ffi" lua = require "Luaffi"
function report_errors(L, status) 
    if status ~=0 then 
        print("-- ", ffi.string(lua_tostring(L, -1))) 
        lua_pop(L, 1); -- remove error message 
function execlua (codechunk) 
    local status, result local L = lua.luaL_newstate();  -- create state

    if L == nil then 
        print("cannot create state: not enough memory") 
        return EXIT_FAILURE; 

    -- Load whatever libraries are necessary for 
    -- your code to start 
    print("luaopen_base: ", lua.luaopen_base(L)) 
    print("luaopen_io: ", lua.luaopen_io(L)); 
    print("luaopen_table", lua.luaopen_table(L)); 
    print("luaopen_string", lua.luaopen_string(L)); 
    print("luaopen_math", lua.luaopen_math(L)); 
    print("luaopen_bit", lua.luaopen_bit(L)); 
    print("luaopen_jit", lua.luaopen_jit(L)); 
    print("luaopen_ffi", lua.luaopen_ffi(L));

    -- execute the given code chunk 
    result = luaL_dostring(L, codechunk) 
    print("Result: ", result) 

    report_errors(L, status)


execlua("print('hello lua')")

That’s a fairly standard looking “main()” for using Lua.  Basically, do everything within LuaJit itself, without having to use a lick of C code.  Now that I have this basic capability, the rest of the task is fairly straight forward.

One challenge ahead is how to communicate between threads.  Well, I’m a big believer in message passing.  From years of doing multi-threaded, multi-processor programming, I know full well that I’m not good at maintaining shared memory state.  Over the years, I have found that the best way to keep things straight is to simply pass messages.  Granted, debugging an asynchronous message passing system is no walk in the park either, but it makes for much more easily scalable systems.  By using message passing, you can focus on ensuring that the messaging mechanism between processes works correctly, and forget the rest.  This style also lends itself easily to being distributed, either across processes, or across the internet, which is a good thing.

One way to pass messages between threads on Windows is to use the PostThreadMessage() function.  Each thread will be running in a little message loop, and when a message comes in, it can be placed on a queue within the lua state, and the executing code (which should be in a coroutine) can pull it out and deal with it at its leisure.

Of course, to extend more broadly, PostThreadMessage can be aliased with something more interesting, like PostIPNode, or whatever.  As long as the function can take a Blob and send it to its destination, I shouldn’t really care.

So, there it is.  First steps towards making a highly scalable Lua based processing engine.

Performance, structures matter…

Performance tuning is always fun.  The very best programmers in the world, know how each and every choice they make will have an effect based on their machine architecture, runtime, compiler, etc.  I guess I’m not one of those types of programmers, because I’ve long since lost track of how what I code affect lowest level microcode.  But, I occasionally make crude attempts to speed things up at a much higher level.

One area I focused on recently was my basic frame buffer.  I had originally coded this as a single array of bytes, or whatever data structure was specified.  I had this convenient FixedArray2D class which made allocating them easy, by wrapping up some ffi goop to get at native data structures.  The downside of this class was that it had “SetElement” and “GetElement”.  Each of those calls did a calculation to determine where in the vast byte array the particular element was.  OK, that’s a no-brainer.  Mostly that calculation is going to be very quick, and maybe even inlined, etc.  But, how about just doing away with it and letting the compiler deal with it, just like in C.

So, I ditched it.  Now, I have some other convenience functions that do the job much more nicely and succinctly.

Array2D = function(columns, rows, kind)
    return ffi.new(string.format("%s[%d][%d]", kind, rows, columns))

Which can be used thus:

local window = Array2D(captureWidth, captureHeight,  "pixel_BGRA_b")
local graphPort = Array2DRenderer.Create(captureWidth, captureHeight, window, "pixel_BGRA_b")

In this case, I’m creating a pixel buffer which is a certain size, and filled with a BGRA pixel type.  Then I construct a renderer on top of that, and move on with life.  This works out quite nicely and gives you easy access to the data using normal 2 dimensional matrix access:

window[row][column] = value

Not only is this much more convenient, it turns out to be much faster as well.  I guess the runtime is taking care of the appropriate calculations, inlining, turning it into machine code, and done.  No more function calls in the way or anything.  Just plain fast code.  I’ve even changed the renderer to use regular array assignments.  For example, with LineH (draw a horizontal line), I used to do a memcpy essentially.  Now, I figure, iterating over the locations, making assignments, might be just as fast as the memcpy, so I just do that, and let the compiler figure out how to optimize it, as this should be easily optimizable.

In the end, what did this get me?  Well, the graphics fill rate has gone up.  I can now draw thousands of tiny little triangles in realtime (30 fps), without much sweat to the system.  That same task was getting bogged down at around 2048 triangles using my previous structures.  So, this is an improvement.

Now I’m casting an eye towards matrix and vector speedups.  At the moment, I represent my matrix class as a 16 element array.  This is convenient for OpenGL, but it’s pretty inconvenient for virtually everything else.  Again, I have to litter the code with offsets and the like, and probably hurt my performance.  I’ll just switch this to being an Array2D of “double” and see where that gets me.  I could just use Lua Tables, but I’m not sure if there’s a higher cost to Lua tables doing the appropriate lookus, or if there’s a higher cost converting types between lua numbers and the native types I’m storing in my structures (typically float).  We’ll see.  Since multiplying a vec3 by a mat4 is the hotpath in graphics processing, making this path as fast as possible is a very desirable thing.

In the meantime, fill rates have gone up.  This bodes well for other data structures, like the ones needed to compress bits of video screen before sending them off to the network.  Fast access, at a very low level will be highly beneficial there.

Model View Project Perspective Screen…

There is an extremely long walk that a pixel has to take from the model to the screen.  In order to recreate the render pipeline, I have to take care of all those transforms all by myself.  Starting from the last one, there is a Viewport transform.  The viewport transform takes “normalized” device coordinates (-1,1 in x-axis and y-axis) and turns them into actual screen coordinates.  I have implemented a ViewportTransform object which performs this particular task.  It’s fairly straight forward:

local vpt0 = ViewportTransform(captureWidth, captureHeight)

Then, to transform a point:

local v11 = vpt:Transform(vec3(-.75, .25, 0))

Since these points have to be normalized, they have to be specified in values between -1, and 1 on all axes.  What will be returned is a vec2, which will be in the range of captureWidth, captureHeight.  That’s real nice.  You can easily deal with top down, or bottom up, just by changing the sign of the captureHeight.

In the above picture, I actually setup 4 different viewport transforms:

local vpt1 = ViewportTransform(captureWidth/2, captureHeight/2, 0,0)
local vpt2 = ViewportTransform(captureWidth/2, captureHeight/2, captureWidth/2, 0)
local vpt3 = ViewportTransform(captureWidth/2, captureHeight/2, captureWidth/2, captureHeight/2)
local vpt4 = ViewportTransform(captureWidth/2, captureHeight/2, 0, captureHeight/2)

Then, when it comes time to render, I just run the vertices through each of the transforms, and I will receive coordinates that are placed in the appropriate quadrant.  That’s just how it works in OpenGL as well, although there is the advantage of a hardware assist in that case.

This is a good thing because it means I can draw different views of the same thing simply by changing the viewport.  That’s what you see in those highend CAD systems where they show different perspectives on the model while you’re working on it.

Similarly, it allows you to fairly easily parcel up the screen, and do things like draw windows, or separate between the 3D Rendering, and the 2D UI portion of the screen.

But, this particular transform works with those normalized values.  In order to get from the model to these normalized values, there are a couple more transforms.  One is to transform from the model’s view of the world to the ‘camera’s view of the world.  That includes a modelview transform, projection transform and a perspective division.  Not too bad.  Once those are in place, there is a complete 2D rendering pipeline, and triangles can show up again.

Calculating Colorful Pictures

Here is a Mandelbrot set coded up using the BanateCore (SetPixel).  Co particular color palette, so just raw value from 0 – 255, turned into grayscale.

In the earliest days of my programming on a Commodore PET, I had fun doing stuff with graphics.  At that time, one of the biggest challenges I had was doing a floor fill routine in 6502 assembly.  We’re a far cry, and quite a few gigahertz beyond the speed of that early machine, but I find myself once again staring at images that were fun to create about 30 years ago.

Nowadays, with the BanateCore, it’s fairly straight forward to simply think of my graphics environment as not much more sophisticated than a VGA frame buffer circa 1980 or so.  You know, select a mode, then setpixel, and you’re done.  So, here’s another machination using those very basics.  A rendering of the mandelbrot set.

Way back in another day, I had the benefit of working with Benoit Schillings while at Be Inc.  Benoit is quite an extraordinary programmer.  Way back then, on a lowly BeBox, I saw him code up a mandelbrot viewer that allowed you so navigate and zoom around in realtime.  Mind you, this was about 15 years ago, with a lot less horse power than we have today.  What I’ve produced here is nothing as slick as what we had then, but it is the basics.  My goals is to be able to tweak the speed until it is satisfactory for realtime navigation.

From what I’ve seen over the years, one of the key challenges in Mandelbrot presentation is selecting the proper color scheme.  In this case, I simply used a frequency converter.  That is, 0 == lowest frequency, 255 == highest frequency.  Turn that into an RGB value, and that’s that.

Same coloring scheme, except in this case it just so happens to look much more interesting because of this particular region of the fractal.

What has been fun about this little experiment is that it’s just one more case where I’m able to wring some more performance out of the system, and simplify the API design that much more.  Little applets like this can be created in roughly 200 lines of code.  My goal is to remove about 50 lines of code from that equation, so that rendering such simple things is so brain dead simple that anyone could do it.

Kinect to Lua

Good News everyone!  At least that’s how one of my favorite electronics supply shops AndaHammer.com starts every post.   I finally got the Kinect to spit out some color information, and I can consume it using LuaJIT!

What a long and arduous road it has been!  One of the things I can say from my experience is, if you’re going to use the Microsoft SDK 1.0 for Kinect, first uninstall all other drivers you might have been fooling around with, and reboot.  This caught me for the longest time.  I was getting sporadic behavior, errors with invalid pointers, and all manner of frustration.

After Uninstalling the CLNui stuff, and rebooting, the proper device driver for the Kinect installed itself, and things improved.

The next challenge had to do with the format of the data coming off the Kinect.  The color data is a somewhat funky format.  It is RGB-32.  Not RGB-24 (8 bits per component, 24 bits per pixel).  RGB-32 is 8 bits per component, but it is actually 32 bits per pixel.  The remaining 8 bits are ignored.  They are not alpha.  If they were, the value would be set to 255.  But, since they are “ignored”, the final 8-bits are set to ’0′.  Well, that’s a bit of a bother.  In my ideal world, I’d simply represent the data as a RGBA pixel, and be done with it.  But, I can’t.  What I need is a RGB-32, which will have an OpenGL signature of RGB.  Well, not the hardest thing in the world, but it will take some small changes to my frameworks to make it happen.

In the meanwhile, I just change my GLTexture object to have an internal format of RGB (ignoring alpha), and that fixed the problem.

What you see in the picture is an upside down representation of the view out my apartment window.  Yep, another one of those little things.  The picture is bottom up, or top down, depending on your perspective.  It’s probably top down coming off the camera, and my OpenGL view of it is flipping it.  But, the data is there.

A little bit about the code.  First of all, it’s all in GitHub.  You can find the Nui low level interop stuff in the BanateCoreWin32 repository under the Kinect directory.  In the directory, there is the Kinect.lua file, which is the simplified wrapper on top of the core Nui calls.

How to use it?  First you initialize a sensor, based on an index value:

require "Kinect"
local sensor0 = Kinect.GetSensorByIndex(0, NUI_INITIALIZE_FLAG_USES_COLOR)

In this case, I am initializing the first sensor in my system (you can have multiples), and telling it I’ll only be grabbing color information, no skeleton tracking or depth information.

Then, when I want to grab a frame and copy bits, I do this:

local success = sensor0:GetCurrentColorFrame()
local colorAccessor = Array2DAccessor({
    TypeName = "Ppixel_BGRA_b",
    Width = captureWidth,
    Height = captureHeight,
    Data = sensor0.LockedRect.pBits,
    BytesPerElement= 4,
    Alignment = 1,



At the moment, you MUST make the call to ReleaseCurrentColorFrame, or you’ll find you’re not getting any frames soon enough.

This is the “polling” style of getting frames from the device.  The preferred way is to actually use an eventing model where the device will alert you when frames are available.  My first run through though, I just wanted to get at the pixels as soon as possible, so I polled.

This device is very finicky.  It doesn’t like to be polled too much, and there are tons of error cases that you need to be aware of and deal with properly.  At this point though, I know I am talking to it successfully.  I even managed to make the tilt go up and down, which is nice.

Now that I have a code path that is basically working, I can move on to making it much nicer to use.  For example, it would be nice to represent the camera as an iterator, where I can just essentially say GetNext(), and get the next instance of the color frame, whithout having to worry about releasing things.

At any rate, there it is, ready for other people to hack on and improve.

Walking the roads of madness

Well, it started innocently enough.  I wanted to add some very fast core networking interop to my code.  So, I started with the Winsock library on Windows because I want to use I/O completion ports.  Alright, fair enough, and an easy job was made of doing basic Berkeley style network socket calls.  Then along came some nefarious data structures, notably, WSAPROTOCOL_INFO.  Buried within that data structure is the innocent GUID.

Well, of course implementing GUID didn’t lead directly down the path of COM, but it was so innocently close I could not resist.  So, down the COM rabbit hole I went.

COM is a very ancient technology.  It was invented in the early days of Microsoft wanting to embed components such as pictures, spreadsheets and the like into Word documents.  It was such a great idea, that over the past 20 years or so, the world has had the benefit of living the glory of a COM existance through Windows.

COM is many things.  It is a type system, a runtime, a programming way of life.  There are tons of interfaces, functions, and the like.  There’s stuff for “server” side, and “client” side, proxies, asynchronous stuff.  The list goes on and on.  There are type libraries, which sometimes lie, and AddRef/Release for “garbage collection”.  There are BSTRs, Variants, SafeArray, and a whole host of other data types, conversions between the data types, and on and on it goes.

If all you want is access to your Kinect though, what to do?

You can start off easily enough:


sensor = CreateObject(IID_INuiSensor);

Once you have a handle on the sensor, you can make some calls.

imageStream = sensor.NuiImageStreamOpen(...)

sensor.NuiImageStreamGetNextFrame(imageStream, 30, imageFrame)

Then go and do something with the frame.  This isn’t exactly the code you would write, but it’s pretty close to what will be there eventually.

To make stuff work in COM, there’s a lot of work that has to be done with interfaces.  It’s just like any other code, you have to define the various structures and method definitions.  In some cases, you might have what’s known as a type library (.tlb) file, which is a binary encapsulation of the interface to the objects.  In most cases though, all you have is a header file with all the GUID and interface goop laid out in barely readable form.

For the first few cases, I am hand coding the interfaces from header files.  A better approach long term will be to use the MIDL compiler to create a .tlb file, then read the .tlb file to get all the type and method information.  Once I can do that, I can construct the definitions from their source and ditch the hand transcription.

There is another possibility that occurs to me though.  Rather than constructing a bunch of code that mimics all the interfaces known to man, since Lua is much more dynamic, interfaces can be constructed on the fly.  If I have the .tlb files for something, then I can read it at runtime, and using the __call() metamethod, just call the appropriate method when the user makes the call.  Basically, I could construct the appropriate ffi.cdef[[]] construct, then make the call, as long as the library has already been loaded into memory.

If that technique works, it would be the bees knees.  No need to generate large amounts of code, just ship the .tlb files, and leave it at that.

At any rate, dealing with “legacy” coding paradigms can be a real pain some times.  The benefit is getting access to a bunch of stuff, which can then be incorporated with new code and paradigms.


The March on COM Continues

Motoring along with a COM interop module…

After GUID/UUID, the next thing to deal with is BSTR.  I’m not much of a COM programmer, but there are two things over the years that I’ve always wanted to stay at arms length from.  The first is the BSTR, and the second is the VARIANT.  These things were possibly greatly useful in their day, but these days, they’re just kind of anacronistic.

At any rate, BSTRs are a beastly must, because in the Kinect API, there are BSTRs.  It seems a bit odd to me that given the way the world works today with dynamic languages, we could not come up with easier interfaces, that don’t require the support of a full framework such as COM to do something as asimple as “GetNextFrame”.  But, there you have it.

I have managed to do the code wrangling to get the Nuixxx interfaces from the Microsoft Kinect SDK into “compilable” shape.  That is, I looked at the SDK headers, and transformed them bit by bit into what I thought was correct form for interop with LuaJIT.  I haven’t actually tested them as yet, because the rest of the COM support isn’t quite fixed.  But, soon enough, I might be able to grab frames from the Kinect.

At any rate, the BanateCoreWin32 codebase has taken on quite a lot of new APIs and whatnot.  If you’re interested in looking at such things, you might head on over to the GitHub and browse around.

Along the way, I actually ran into a stumper, and none other than Mike Pall gave me some hints and tips to make things better.  The primary thing I had to deal with was forward declarations of structures.  A common construct such as:

typedef struct AStructure AStructure

typedef struct otherstruct {
AStructure *food;
} otherstruct;

typedef struct AStructure {
otherstruct *areference;

The Microsoft COM headers are full of stuff like this, except they use the word “interface” instead of ‘struct’.  In most cases, I could just change the word to ‘struct’, and the rest is fine, and things work out.

While he was at it, Mike pointed out that the construct that I was using:

floatv = ffi.cdef("float[?]")

wasn’t the most efficient thing in the world.  Although it’s convenient when you’re doing more interesting structures than basic types, it’s not particularly performant when dealing with base types such as numbers.  Instead, it’s better to just do:

vec3 = ffi.cdef("float[3]")
vec4 = ffi.cdef("float[4]")

You still can’t assign a metatable, as far as I know, but you get the easy constructor semantics:

vec3(x, y, z)

So, things are humming along.  Having spent a day on the bowels of COM interop, I feel even more strongly that such interfaces need to be left far behind.  There is too much time and energy spent on trying to massage the right things to happen.  In comparison, doing interop to straight C interfaces is almost mind numbingly simple.  Of course, it was great for the time, and solved a lot of problems, but I wonder what would happen if interfaces today were defined in something like Lua, instead of IDL and COM, etc.

LuaJIT COM Support

I am primarily a Windows based developer, so I like to take advantage of the many facilities that are available on the platform.  In the early days of Windows, most features were provided as straight ‘C’ libraries.  Sometimes, the were Pascal calling convention though, and that makes for fun and games.

Then, along came COM, and the world went off kilter for a while (my humble opinion).  COM was great for getting component parts into Word and Excel.  As a general programming paradigm, I never cared for it that much.  It tries to do too much, and while doing it, gets in my way more often than it helps me.  Now, most of the time, if you’re programming in Windows, you have the full capabilities of the Visual Studio environment to help you with your COM object interactions.  Nowadays, dealing with COM in that world isn’t much harder than dealing with any other object technologies.

But, here I am in LuaJIT, and suddenly all that support evaporates.  Now, of course there is the LuaCOM project, but why won’t I be using that?  Because it was written for ordinary Lua 5.x in mind, so it does a lot of stuff that I don’t need to do.  Furthermore, like many Lua interop libraries, it seems to have met the needs of the original authors, so it doesn’t seem to be much maintained these days.  Well, that’s good enough excuse for me to write a new interop layer using LuaJIT!!

I tell you, COM is a big hairy beast with sprawling tendrils that run all up and down the Windows OS, across multiple sub-systems, sucking in library upon library, paradigm upon paradigm.  I started innocently enough with the GUID/UIID object:

typedef struct {
unsigned long  Data1;
unsigned short Data2;
unsigned short Data3;
unsigned char Data4[8];

That’s a good start as everything in the COM world is related to a GUID of one form or another.  So, defining the basic data structure is easy enough.  Once this data structure is defined, then creating a GUID can’t be that hard:

__index = {
Define = function(self, name, l, w1, w2, b1, b2,  b3,  b4,  b5,  b6,  b7,  b8 )
return GUID({ l, w1, w2, { b1, b2,  b3,  b4,  b5,  b6,  b7,  b8 } }), name
DefineOle = function(self, name, l, w1, w2)
return GUID({ l, w1, w2, { 0xC0,0,0,0,0,0,0,0x46 } }), name

So, for creating a typical Ole GUID, you can simply do:

GUID():DefineOle(name, l, w1, w2)
GUID():Define(name, l, w1, w2, b1, b2, b3, b4, b5, b6, b7,b8)

Like in this case:

IID_INuiSensor =("IID_INuiSensor",0x1f5e088c,0xa8c7,0x41d3,0x99,0x57,0x20,0x96,0x77,0xa1,0x3e,0x85);

And, of course, the windows sytem itself has some convenience functions for creating new GUIDs, and constructing them from strings, so the following can be done as well:

IEnumMoniker = UUIDFromString("00000102-0000-0000-C000-000000000046")

Great.  That gets you the IID_ for some specific GUID.  That’s half the battle in COM.  The second half of the battle, or rather, the 2nd part of about 5 parts, is being able to get at your “interfaces”.  In COM, an “interface” is essentially defined as an structure, within which you’ve laid out the method calling table.  So, for example, here’s the definition of the “IUnknown” interface, which is a fairly common base for all other interfaces:

typedef HRESULT (*QueryInterfacePROC)(void * This, REFIID riid, void **ppvObject);
typedef ULONG (*AddRefPROC )(void * This);
typedef ULONG (*ReleasePROC)(void * This);

typedef struct _IUnknownVtbl {
QueryInterfacePROC QueryInterface;
AddRefPROC   AddRef;
ReleasePROC   Release;
} IUnknownVtbl;

typedef struct _IUnknown {
IUnknownVtbl *lpVtbl;
} IUnknown, *LPUNKNOWN;

Basically, first define your function prototypes.  Then define a structure in which this virtual table is the first item.  If you can load an object using the COM functions, then you can retrieve a specific interface, and make function calls.  Easy right?  So, almost half way there.  There are some issues related to the lifetime of objects, but LuaJIT gives you fairly good control of the lifetime of objects.  I know when they’re about to be garbage collected, so I can call a specific function to do whatever cleanup is necessary before the object moves on to the dearly departed.

Why bother with all this nonsense though?  Because.  First, I wanted to get simple access to my webcam, so that I could stream audio and video, and get at the hardware compression routines.  Then, I want to get at my joystick.  Although I can access it through the ancient WinMM interfaces, I’d rather use the more modern stuff.  And lastly, there’s the Kinect.  It’s got the simplest interface for getting at the video frames and audio, but you have to go through some COM calls to get there.

And so, it’s quite a ball of yarn to be pulling on, but I am managing to walk a path that gets me some basic COM interop sooner rather than later.  Once that basic interop is in place, I think getting at more of the Windows environment will just be that much easier.

Essential Performance Tuning

At some point in the software development cycle, I turn my attention to performance tuning.  I’d like to think that I design for performance from the start, and I really do try to, but, then there’s the nitty gritty details you have to work out in implementation to ensure you actually achieve the performance you had envisioned in your dreams.

Here is a picture of  a fill rate test of the renderer.  Basically, I’m trying to answer the question, “What’s the fastest I can render a solid triangle?”.  If you’ve even done any graphics, particularly 3D, you’ll recognize that having a very fast triangle render routine is kind of essential to getting highly performant graphics.  In this particular case, I created a framebuffer that was 1920×1080, then I run at a frame rate of anywhere from 15 to 30, and I just keep upping the number of triangles drawn per frame until I can’t hold the frame rate.  Each triangle is randomly between 50 and 100 on each side, so although small, they’re fairly typical of the types that are found while doing some high polygon count 3D rendering.  The amount of time is dominated by triangle setup.  I create a DDA (line walker), do vertex sorting, then walk the DDAs, doing memory fills of a solid color along the way.  And of course, I’m not cleaning much up, so I’m generating a ton of garbage along the way, which must all be cleaned up as well.

With all this, my CPU is sitting at 25-27%, so I’m nowhere near maxxing out my CPU, although, when I look at a single core of my quad proc, it can be sitting at 75-80%.  Is there opportunity for parallelism?  I could certainly imagine staging things such that the triangle scanning is separate from the polygon filling.  I could create a graphics pipeline similar to what OpenGL does.  Breaking things into geometry, vertex, and fragment shaders.  Even more interesting, I could create such a pipeline by creating separate instances of Lua State to run on separate threads, one for each processor core.  Then, send messages between them using shared memory or something.  Not quite the same as the massive number of rendering cores in a GPU, but I could still get some benefit.

Then there’s lower level stuff.  Back to data types, I’ve reconfigured what my data types look like.  Here I had some fundamentals to decide.  The thing I have to decide has to do with coding style more than anything else.  When I look at the vector and matrix classes, how do I want to program them.  Do I want to favor creation of the instances, or do I want to favor ease of performing simple arithmetic, or do I want to favor interop, or can I favor a couple at the same time?

I’m opting for favoring performance, both with and without interop.  So, let’s look at the simplest case, the lowly vec3.  In OpenGL GLSL language, vec3 is:  float[3]  That is, a simple array with 3 float values in it.  In GLSL, there is some syntactic sugar that allows you to do simple arithmetic with vectors as if they were basic types, so you can do:  v3 = v1 + v3

OK, that’s very convenient.  Is it dramatically different than v3 = add(v1,v2)?

In LuaJIT, I can have the convenience of that syntactic sugar, but it requires me to create a struct to represent the vec3, which means that it’s not quite as easy to pass the thing around when it comes time to do interop.  There will always be some sort of pointer reference involved.  Not a bad tradeoff I guess.

But, here’s another way of doing things.

float = ffi.typeof("float")
floatv = ffi.typeof("float[?]")

With these two simple definitions, I can do the following:

local myfloat = float(3.7)
local myfloatarray = floatv(320*240*4, 3.7)

In the first case, I am allocating a single float value.  I can then go and use that float value anywhere as if it were a number.  In some cases, I’ll have to actually convert it to a number to pass to a function that expects a lua number, and can’t deal with the fact that this is actually cdata.  But, as long as I stay within the confines of my own little world here, it’s fine.

In the second case, I’m allocating an array of floats.  Here, it could represent an array of rgba values (4 floats) in a framebuffer that is 320×240.  Well, that’s pretty precise control of things.

Here’s another case where I’m favoring syntax, as well as interop:

vec3 = function(x,y,z) return floatv(3, x,y,z) end
local vertex1 = vec3(10, 20, 30)

That is way convenient!  I had this funny realization that C macros are nothing more than functions in Lua.  Duh!  Even better, they’re not just macros, they’re actually functions.  In this particular case, I want to essentially mimic the vec3 type found in OpenGL GLSL.  So, I’ve already created that convenient floatv type, which will readily create arrays of a particular size for me.  So, I just use that, conveniently creating an array with three elements, and initialize it with whatever the user had passed in.  And there you have it.

Now, when it comes time to do some interop to an OpenGL function, I can simply pass myfloatarray, or vertex1, to a function that expects a pointer to a float array, and that’s the end of that.  At a very low level, passing a single parameter to a function is faster than apssing 3 parameters to a function, so passing a pointer to an array with 3 floats, is faster than calling the functions that require 3 floats, as long as the array is relatively long lived.  Not a lot of marshalling going on, not any conversion from Lua number to float.  The thing is garbage collected in time, and that’s very convenient, and saves me from having to do the whole malloc/free thing that I would normally have to do in ‘C’.  This is why I say Lua might be the new ‘C’, at least for me.  I get many of the benefits of low level stuff in C (minus easy bit twiddling) without all the hassles and pitfalls of bad memory management.

So, that’s a bit of performance tuning.  Choosing the right data structures to meet the needs of the situation.  I am making the design choices at the fundamental structural representation layer.  My next are of focus is networking code.  I need to get maxium connections per processor, and maximum throughput.  We’ll see how it goes.

I wants me my data types!!

One of the challenges of programming in Lua is having precise control of data types.  Or, rather, having the typical set of data types built in.  The built in data types for Lua are quite sparse.  For example, there is ‘number’, which is essentially a double.  There is no separation for byte, short, int, float, just ‘number’.  There is ‘string’, and that’s a catch all for bytes in general as it is 8-bit clean, as they say.  But, what if I really want the semantics of an unsigned char, or short?

Well, that’s where LuaJIT can really be useful.  First, you can do the following to declare some basic data types:

typedef unsigned char Byte;
typedef short   Int16;
typedef int    Int32;
typedef float    Single;
typedef double    Double;

Byte = ffi.typeof("Byte")
Int16 = ffi.typeof("Int16")
Int32 = ffi.typeof("Int32")
Single = ffi.typeof("Single")
Double = ffi.typeof("Double")

Then you can do the following to create instances of those types:

local val1 = Int16(30)
local val2 = Byte(15)
local val3 = Double(val1 + val2)

I think this is useful.  You can go ahead and use the variables with the expected semantics of each type.  It would be nice if you could go further and actually use bitwise operators, but, since that’s not really baked into Lua, you can’t really do it.  The bitwise operators are defined against the ‘number’ type, so, you have to convert back tonumber, in order to use them:

local bits1 = Byte(0x01)
local bits2 = Byte(0x10)
local bitsor = bor(tonumber(bits1), tonumber(bits2))
local bitsand = band(tonumber(bits1), tonumber(bits2))
local bitsplus = bits1+bits2

But, if you’re not dealing with bitwise operators, and you’re just passing things around, consuming them, doing calculations and the like, you can actually get away with doing things as I’ve done here.  So, it depends on how you’re programming.  If you find yourself doing a lot of interop programming, you might find it useful to create these native data type declarations.  That way, when it comes time to martial, you won’t always be converting from number, and your interop will be faster.  If you are spending most of your time within Lua, not doing interop, then it’s probably better just to stick with the native ‘number’ type instead.

And so it goes.  If you want a more traditional feel to your programming, LuaJIT makes it fairly easy.  Otherwise, you can just stick with the standard way of doing things.  In some sense, LuaJIT seems like the new “C” to me.  It allows you to construct any manner of higher level abstractions, while at the same time allowing you to get down pretty low to the machine representation of things.

There is another gem in the LuaJIT firmament, and that is the Dynasm assembler.  Not something you can use from LuaJIT directly, but with it you can fairly easily incorporate C and assembly routines together, and that’s got to be useful as well.  But, another time.


Get every new post delivered to your Inbox.

Join 44 other followers