Unchaining the GPU with Lua and OpenCL

Quite a few years ago, I programmed the BeBox to display multiple streams of .mpg video, while simultaneously pulling in video feeds from Satellite and cable. In all, you could see snapshots of roughly six things on the screen, happening all at the same time.

The CPUs were utilized primarily for the mpeg part, doing decoding, and some special effects when changing sources being displayed in the primary area. The feeds coming off the Happauge video capture card were being DMAd directly into the framebuffer of the graphics card, so there wasn’t any work by the CPU going on there.

That was a pretty good result for a dual-proc machine circa 1996. That was at the very beginning of the birth of nVidia, and GPUs were actually first becoming mainstream from 3dfx. Roll forward 16 years… and where are we today?

Well, the machine whining away under my desk is a 3.4Ghz AMD Phenom(tm) II X4 965 Processor, with 8Gb of RAM. The graphics card is an nVidia gfx 275. This machine is a couple years old now, but compared to that BBox, it’s a monster from another planet. As such, you would think it would be able to perform the same feats as that old machine, without even heating up a single resistor. To make it even more of a monster, there’s that GPU sitting in there which has 1000 times over the amount of processing power utilized to send people to the moon in the sixties.

So, what can this machine do? Well, It allows me to type really fast!! I can read emails in the blink of an eye, and Netflix movies play with nary a stutter!  I tell you, it’s simply amazing!  But, what about all that horsepower that’s sitting idle under my desk?  Surely I can put it to some good usage.

Well, of course graphics processing can largely be offloaded to the GPU these days.  Although I conjured up a graphics library that lives complely on the CPU, and just draws to memory, doing the same using the GPU is far faster, and takes a lot less electricity.

And finally, I come to the point.  I have gotten far enough along with my OpenCL binding that I can now actually do some OpenCL programming.  OpenCL is an interesting little thing.  Basically, it introduces the concept of ‘kernel’ programming.  And here, Kernel does not mean the OS kernel, but rather the small little bit of code that will run in parallel on the same piece of memory that other little bits of code are running against.  This is in fact what happens when you’re running a GLSL shader.  It’s just a little ‘kernel’, and in the case of a fragment shader, that little kernel runs against all the pixels in a frame, in parallel with hundreds of others doing the same thing.

Using GLSL based fragment shaders is great for graphics programming, but for general computing, it’s kind of clunky as you’d have to cast your compute problem into terms that the graphics pipeline can understand.  Furthermore, in order to use GLSL at all, you have to do things like create a GLContext, which requires a DeviceContext, which requires a Window, or at least a GDIBitmap.  That’s a lot of machiner to just write a bit of code to manipulate some data.

OpenCL changes things a bit.  First of all, you have access to the GPU power without the graphics constructs.  You still have to create a proper context, but it’s far easier without having to worry about windows and bitmaps.  There are some concepts, and a hierarchy for doing things.  You start at the top with platforms.  There may be multiple “platforms” within your machine.  Usually there is only one though.  Within a platform, there are devices.  There may be multiple devices in a platform.  For example, you might have two nVidia cards in your machine, and that will list as two devices.

After the device, there is the concept of a context.  The context can span multiple devices.  The context controls things like where memory is created, where programs are created, where kernels are run, and the like.  This is really where things start to get interesting.

From the context, you can create a “program”.  Here, I think it is easier to think of the program as “image”.  You are essentially placing an “image” onto the context.  I think of the image as the raw OS image, ready to have little bits of code running in it.

Then, finally, you can create a “kernel”, which is actually a piece of code that’s going to execute on the device.

That’s a lot of stuff, and a lot of error checking, and a lot of pointers that can go wrong, etc.  So, the Lua version looks like this:

local platform, num = CLGetPlatform()
local devices = platform:GetDevices(CL_DEVICE_TYPE_GPU)

That is, get the first plaform available. Then, get the list of devices available on the platform. And finally, run a kernel (code below).

Using Lua is nice because garbage collection can be used to release various resources when they’re no longer in use. That saves a bit of typing, and you don’t have to remember anything.

To run a kernel, I looked at a simple example in C, written by Clifford Wolf.

local program_source = [[
    __kernel void simple_demo(__global int *src, __global int *dst, int factor)
        int i = get_global_id(0);
        dst[i] = src[i] * factor;

function runkernel(device)
    local context = CLContext():CreateForDevice(device);

    local program = context:CreateProgramFromSource(program_source);

    local NUM_DATA = 100;
    local buffsize = ffi.sizeof("int")*NUM_DATA;

    local input_buffer = context:CreateBuffer(buffsize, CL_MEM_READ_ONLY);
    local output_buffer = context:CreateBuffer(buffsize, CL_MEM_WRITE_ONLY);

    local factor = 2;
    local lpfactor = ffi.new("int[1]", factor);

    local kernel = program:CreateKernel("simple_demo");

    kernel:SetIndexedArg(0, input_buffer.Handle, ffi.sizeof("cl_mem"));
    kernel:SetIndexedArg(1, output_buffer.Handle, ffi.sizeof("cl_mem"));
    kernel:SetIndexedArg(2, lpfactor, ffi.sizeof("int"));

    local queue = context:CreateCommandQueue(input_buffer);

    local intsize = ffi.sizeof("int");
    local lpi = ffi.new("int[1]");
    for i=0, NUM_DATA-1 do
        local offset = intsize*i;
        lpi[0] = i;
        queue:EnqueueWriteBuffer(input_buffer, offset, lpi, intsize);

    local global_work_size = ffi.new("size_t[1]",NUM_DATA);
    local kernel_completion = queue:EnqueueNDRangeKernel(kernel, global_work_size);


    local lpdata = ffi.new("int[1]");
    for i=0, NUM_DATA-1 do
        local offset = i*intsize;
        local err = ocl.clEnqueueReadBuffer(queue.Handle, output_buffer.Handle, 
            CL_TRUE, offset, intsize, lpdata, 0, nil, nil);
        CL_CHECK(err, "clEnqueueReadBuffer");

In the first part of runkernel(), I’m using the nice object like interface that the Lua binding provides. In the last part of the function, I’m using the straight OpenCL calls, just to show how that’s done.

There are a couple of things of note here. First, the ‘program_source’ is just a string. This is the same as with GLSLProgram. There are various environments available, including from nVidia, which will help you create these kernel strings. Once you have your string perfected, you can just drop it in for inclusion as your kernel.

Since a kernel is not a function in lua that you can just pass variables to, you have to do some explicit work to pass values in as arguments. kernel:SetIndexedArg() performs this task. This is an ideal candidate for some Lua magic to make it simpler. Unlike the GLSL interface, I can’t query the program to find out the types of the various arguments. But, since I wrote the kernel, I do know their types, so, I write a little table that maps the index to a name, and the data values, and this code could turn into a more familiar:

kernel.src = input_buffer
kernel.dst = output_buffer
kernel.factor = 2

Then I’d be happy as a clam. There is another concept that gets in your face here. That’s the whole queuewrite, queueread business. Basically, all data and kernel deployment happens as commands executed from a queue. That fact does not need to be front and center, and a little bit of wrapping might make it nicer to deal with.

Now that this is in hand, what can be done with it? Well, there’s the obvious graphics stuff, which is where it came from, but there’s a whole lot more. I was just thinking that this might be a great way to perform base64 encoding for example. It’s a largely parallel task. You could write a kernel that turns a 3-character block into the equivalent 4-character code. As this kernel can run in parallel, you could literally have hundreds of them working on encoding your text at the same time. At the end, you’ve got a base64 encoded thing, in a fraction of the time it would normally take.

Using a slightly different approach, that of stream processing, you could probably perform some cryptographic operations, like digest calculations and the like.

There is one tool that I found that makes exploring OpenCL fairly easy and fun. OpenCL Studio is done by Geist Software Labs, who appear to be a consultancy for high performance computing. They have a nice Lua scriptable environment that allows you to play with OpenCL and OpenGL, just like that.

Having such a tool available is an accelerant for me to get even more productivity wrung out of myself, and my machine.

With my little Lua Binding to OpenCL, I am confident that I’m going to be able to get more per killowatt out of my programming.  That’s good for my programs, and good for the environment.  I’m hoping that between a fast quad-proc, super duper graphics card, and Lua, I’ll finally be able to write and utilize programs that are more impressive that what I could do 15 years ago.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s