LuaJIT to Khronos

I’ve been sitting on some code for quite some time.   Hording it as it were, for my own devices.  As my current desktop machine seems to be failing more frequently, I thought it would be a good time to do some spring cleaning and put up some more code.

The current round has to do with things related to the Khronos Group of APIs.  The Khronos Group is one of those industry bodies setup for collaboration across multiple companies.  Probably the most famous of the APIs they’ve dealt with to date is the OpenGL API.  The followon to that was the OpenGL ES API.  Then along came one of their own the OpenVG, which deals with Vector graphics.  Rounding out the set, you have OpenCL for distributed computing, and OpenMAX for Audio/Video stuff.

Since the group was originally founded by companies interested in GPUs, those are the APIs that are the most mature.  Well, recently, playing with the Raspberry Pi, I found that OpenGL ES/EGL and OpenVG, OpenMAX are the only way to get at hardware acceleration on the device.  There are already a couple of examples of OpenVG running on the Pi.  One Example of using OpenVG on the Pi was done by Anthony Starks.  If for no other reason, you’ve got to check out the code because of the author’s name…

Anthony Starks has several examples of how to use OpenVG, and how to bind and use the Go language to do some really nice stuff.  Well, no reason for those Go programmers to have all the fun, so I decided to make a really convenient binding to LuaJIT for OpenVG.  But, why stop there.  Why not all the APIs?  Well, I already had the OpenCL, and OpenGL laying about, so I’ve put them all together into a single repository, the LJIT2Khronos project.

One of the first things that I realized way back, was that these things can only be useful if they can be demonstrated.  Well, one of the first things you need to do to use any of these APIs is establish a connection to the windowing system.  In this case, I had to include some bindings to Windows APIs as well.  There is a Win32 directory, which contains some basic bindings for User32, GDI, Kernel32, Windows types and the like.  More than enough to get a basic window up on the screen, and certainly enough to get a window handle and device context that is required for the various APIs.

What do you get for your troubles?

Let’s say you want to create a window, which has a ‘frame rate’ of 3 frames per second, and a routine you specify will be called, and rendering will occur…

local NativeWindow = require "User32Window"
ocal EGL = require "egl_utils"

local OpenVG = require "OpenVG"
local OpenVGUtils = require "OpenVG_Utils"
local ogm = require "OglMan"
local RenderClass = require"Drawing"

-- Setup the "display" object
local dpy =, EGL.EGL_OPENVG_API);

local screenWidth = 640;
local screenHeight = 480;

-- Create the renderer class which
-- will handle drawing tasks
local Renderer =, screenWidth, screenHeight);

local tick = function(ticker, tickCount)
	print("Tick: ", tickCount);


	Renderer:Background(0, 0, 0);	  -- Black background



-- Create a window
local winParams = {
	ClassName = "EGLWindow",
	Title = "EGL Window",
	Origin = {10,10},
	Extent = {screenWidth, screenHeight},
	FrameRate = 3,

	OnTickDelegate = tick;

-- create an EGL window surface
local win =
assert(win, "Window not created");

local surf = dpy:CreateWindowSurface(win:GetHandle())

-- Make the context current


local ratio = screenWidth/screenHeight;
glFrustum(-ratio, ratio, -1, 1, 1, 10);

-- Now, finally do some drawing

-- free up the display

I’ll be the first to admit, this is still quite a lot to type to do some very basic rendering, but this is way less typing that you’d have to type on your own. I’ll write up a separate article that goes into more depth of how to use this OpenVG stuff, for now, suffice to say, it will work on whatever environment has the OpenVG and EGL libraries available (at least Windows, Raspberry Pi, Linux in general).

But of course, there’s more. This repository also include the OpenGL bindings, and the OpenCL stuff as well. Those bindings are fairly mature, at least I’ve written a couple of apps using them.

So, there you have it. Some fairly complete bindings to these various Khronos Group APIs. Getting them off my machine, and into the interwebs gives me some relief. As well, I expect to make full use of them across the multiple environments in which they are available.

Unchaining the GPU with Lua and OpenCL

Quite a few years ago, I programmed the BeBox to display multiple streams of .mpg video, while simultaneously pulling in video feeds from Satellite and cable. In all, you could see snapshots of roughly six things on the screen, happening all at the same time.

The CPUs were utilized primarily for the mpeg part, doing decoding, and some special effects when changing sources being displayed in the primary area. The feeds coming off the Happauge video capture card were being DMAd directly into the framebuffer of the graphics card, so there wasn’t any work by the CPU going on there.

That was a pretty good result for a dual-proc machine circa 1996. That was at the very beginning of the birth of nVidia, and GPUs were actually first becoming mainstream from 3dfx. Roll forward 16 years… and where are we today?

Well, the machine whining away under my desk is a 3.4Ghz AMD Phenom(tm) II X4 965 Processor, with 8Gb of RAM. The graphics card is an nVidia gfx 275. This machine is a couple years old now, but compared to that BBox, it’s a monster from another planet. As such, you would think it would be able to perform the same feats as that old machine, without even heating up a single resistor. To make it even more of a monster, there’s that GPU sitting in there which has 1000 times over the amount of processing power utilized to send people to the moon in the sixties.

So, what can this machine do? Well, It allows me to type really fast!! I can read emails in the blink of an eye, and Netflix movies play with nary a stutter!  I tell you, it’s simply amazing!  But, what about all that horsepower that’s sitting idle under my desk?  Surely I can put it to some good usage.

Well, of course graphics processing can largely be offloaded to the GPU these days.  Although I conjured up a graphics library that lives complely on the CPU, and just draws to memory, doing the same using the GPU is far faster, and takes a lot less electricity.

And finally, I come to the point.  I have gotten far enough along with my OpenCL binding that I can now actually do some OpenCL programming.  OpenCL is an interesting little thing.  Basically, it introduces the concept of ‘kernel’ programming.  And here, Kernel does not mean the OS kernel, but rather the small little bit of code that will run in parallel on the same piece of memory that other little bits of code are running against.  This is in fact what happens when you’re running a GLSL shader.  It’s just a little ‘kernel’, and in the case of a fragment shader, that little kernel runs against all the pixels in a frame, in parallel with hundreds of others doing the same thing.

Using GLSL based fragment shaders is great for graphics programming, but for general computing, it’s kind of clunky as you’d have to cast your compute problem into terms that the graphics pipeline can understand.  Furthermore, in order to use GLSL at all, you have to do things like create a GLContext, which requires a DeviceContext, which requires a Window, or at least a GDIBitmap.  That’s a lot of machiner to just write a bit of code to manipulate some data.

OpenCL changes things a bit.  First of all, you have access to the GPU power without the graphics constructs.  You still have to create a proper context, but it’s far easier without having to worry about windows and bitmaps.  There are some concepts, and a hierarchy for doing things.  You start at the top with platforms.  There may be multiple “platforms” within your machine.  Usually there is only one though.  Within a platform, there are devices.  There may be multiple devices in a platform.  For example, you might have two nVidia cards in your machine, and that will list as two devices.

After the device, there is the concept of a context.  The context can span multiple devices.  The context controls things like where memory is created, where programs are created, where kernels are run, and the like.  This is really where things start to get interesting.

From the context, you can create a “program”.  Here, I think it is easier to think of the program as “image”.  You are essentially placing an “image” onto the context.  I think of the image as the raw OS image, ready to have little bits of code running in it.

Then, finally, you can create a “kernel”, which is actually a piece of code that’s going to execute on the device.

That’s a lot of stuff, and a lot of error checking, and a lot of pointers that can go wrong, etc.  So, the Lua version looks like this:

local platform, num = CLGetPlatform()
local devices = platform:GetDevices(CL_DEVICE_TYPE_GPU)

That is, get the first plaform available. Then, get the list of devices available on the platform. And finally, run a kernel (code below).

Using Lua is nice because garbage collection can be used to release various resources when they’re no longer in use. That saves a bit of typing, and you don’t have to remember anything.

To run a kernel, I looked at a simple example in C, written by Clifford Wolf.

local program_source = [[
    __kernel void simple_demo(__global int *src, __global int *dst, int factor)
        int i = get_global_id(0);
        dst[i] = src[i] * factor;

function runkernel(device)
    local context = CLContext():CreateForDevice(device);

    local program = context:CreateProgramFromSource(program_source);

    local NUM_DATA = 100;
    local buffsize = ffi.sizeof("int")*NUM_DATA;

    local input_buffer = context:CreateBuffer(buffsize, CL_MEM_READ_ONLY);
    local output_buffer = context:CreateBuffer(buffsize, CL_MEM_WRITE_ONLY);

    local factor = 2;
    local lpfactor ="int[1]", factor);

    local kernel = program:CreateKernel("simple_demo");

    kernel:SetIndexedArg(0, input_buffer.Handle, ffi.sizeof("cl_mem"));
    kernel:SetIndexedArg(1, output_buffer.Handle, ffi.sizeof("cl_mem"));
    kernel:SetIndexedArg(2, lpfactor, ffi.sizeof("int"));

    local queue = context:CreateCommandQueue(input_buffer);

    local intsize = ffi.sizeof("int");
    local lpi ="int[1]");
    for i=0, NUM_DATA-1 do
        local offset = intsize*i;
        lpi[0] = i;
        queue:EnqueueWriteBuffer(input_buffer, offset, lpi, intsize);

    local global_work_size ="size_t[1]",NUM_DATA);
    local kernel_completion = queue:EnqueueNDRangeKernel(kernel, global_work_size);


    local lpdata ="int[1]");
    for i=0, NUM_DATA-1 do
        local offset = i*intsize;
        local err = ocl.clEnqueueReadBuffer(queue.Handle, output_buffer.Handle, 
            CL_TRUE, offset, intsize, lpdata, 0, nil, nil);
        CL_CHECK(err, "clEnqueueReadBuffer");

In the first part of runkernel(), I’m using the nice object like interface that the Lua binding provides. In the last part of the function, I’m using the straight OpenCL calls, just to show how that’s done.

There are a couple of things of note here. First, the ‘program_source’ is just a string. This is the same as with GLSLProgram. There are various environments available, including from nVidia, which will help you create these kernel strings. Once you have your string perfected, you can just drop it in for inclusion as your kernel.

Since a kernel is not a function in lua that you can just pass variables to, you have to do some explicit work to pass values in as arguments. kernel:SetIndexedArg() performs this task. This is an ideal candidate for some Lua magic to make it simpler. Unlike the GLSL interface, I can’t query the program to find out the types of the various arguments. But, since I wrote the kernel, I do know their types, so, I write a little table that maps the index to a name, and the data values, and this code could turn into a more familiar:

kernel.src = input_buffer
kernel.dst = output_buffer
kernel.factor = 2

Then I’d be happy as a clam. There is another concept that gets in your face here. That’s the whole queuewrite, queueread business. Basically, all data and kernel deployment happens as commands executed from a queue. That fact does not need to be front and center, and a little bit of wrapping might make it nicer to deal with.

Now that this is in hand, what can be done with it? Well, there’s the obvious graphics stuff, which is where it came from, but there’s a whole lot more. I was just thinking that this might be a great way to perform base64 encoding for example. It’s a largely parallel task. You could write a kernel that turns a 3-character block into the equivalent 4-character code. As this kernel can run in parallel, you could literally have hundreds of them working on encoding your text at the same time. At the end, you’ve got a base64 encoded thing, in a fraction of the time it would normally take.

Using a slightly different approach, that of stream processing, you could probably perform some cryptographic operations, like digest calculations and the like.

There is one tool that I found that makes exploring OpenCL fairly easy and fun. OpenCL Studio is done by Geist Software Labs, who appear to be a consultancy for high performance computing. They have a nice Lua scriptable environment that allows you to play with OpenCL and OpenGL, just like that.

Having such a tool available is an accelerant for me to get even more productivity wrung out of myself, and my machine.

With my little Lua Binding to OpenCL, I am confident that I’m going to be able to get more per killowatt out of my programming.  That’s good for my programs, and good for the environment.  I’m hoping that between a fast quad-proc, super duper graphics card, and Lua, I’ll finally be able to write and utilize programs that are more impressive that what I could do 15 years ago.