Kinect with Kuwahara Filtering

Sometimes I see those commercials on TV where the people look like they kind of cartoonish, but the video is still real life. I suspect they probably use some form of spacial filtering while maintaining the various boundaries. One such filter is the so called “Kuwahara” filter. It’s one of those things that takes a neighborhood of pixels, and does a sort of averaging of them, and out pops a noise reduction in the image.

The video here shows a Kuwahara filter running as a GLSL pixel shader. I can’t take credit for the code, it actually comes almost verbatim out of the book GPU Pro. In this case, the video source is the Kinect of course.

This filter easily maintained a respectable 30 frames per second.

What’s more interesting to me is the fact that it’s so totally obvious that the GPU can be utilized for this sort of thing, that I can’t imagine why I’d want to rely on the CPU alone for such tasks. Considering the fact that even the low-end Raspberry Pi contains OpenGL capabilities, that means that this type of powerful processing is available, even on what I would consider to be a throwaway computicle.

I guess there’s no excuse for coming up with fairly lame software that does not take advantage of the serious compute power that can be had almost from a box of cereal.

The Existential Program

In roughtly 1990, I wrote this program called “Who’s Calling?” for the Nascent NeXT computer.  The program itself was one of those early day PIM type programs, you know, contacts, calendar, etc.  The most interesting thing about it at the time was the ‘security’ mechanism I used to prevent random copying.  Basically, it had this little routine it ran when you installed it.  It would find a bit of unique information about your machine, and use that as key information to create a digest.  Once installed, every time the program launched, it would check that digest, and try to decode based on the bit of information from the currently executing machine.  If it got back what it expected, it would continue running.

It was a fairly simple process, that prevented casual copying.  In the end, it was probably more trouble than it was worth, but while coming up with that little trick, I headed down this path of exploring identity.  I asked myself questions such as “how can a program know when it has been tampered with”?  “Does a program have an identity?”.  At the same time, I was writing some collaboration apps, and other questions of authorization and authentication came up there as well.  If I’m collaborating with several people in a session, who has the right to take over the pen?  Does a new user requesting entrance into the collaboration have the right permissions to do so?  How do I know who they are?

This was all pre-internet boom, so it was very interesting.  Then the internet came along, and we went through a round of Berkeley style “free love, free information, free sharing…”, and then it all came crashing down.

But, now we find ourselves at the internet trough again, and those old questions of identity and authorization are coming back, but it a much more serious way.  People are exchanging banking information, and they really don’t want that shared or used by unauthorized entities.

So, I find myself contemplating identity and authorization in a big way, and I find myself going back to the very same old questions.  Does a program have an identity?  What is the smallest unit of existance?  As far as computers are concerned, I can tie this to the idea of the computicle.  Does a computicle have an identity?  The answer to me is yes, and in fact, this is part of what defines a computicle in the first place.

Looking back, I defined a computicle as having a few key traits:

  • Ability to receive input
  • Ability to perform some operations
  • Ability to communicate with other computicles

There is an implied identity in here.  You have to ask the question, “What has the Ability to…?”

You wouldn’t believe how this comes up in some of the most esoteric and mundane places.  For example, once I create multiple cooperating threads in my super computer simulator, I have to answer the question “Who’s responsible for managing chunks of memory?”

The scenario is simple.  Let’s say I have one computicle which does nothing more than generate keyboard/mouse events, and hands them off to other computicles to deal with.  Every time it creates an event, it allocates a chunk of memory, to hand off to the other threads.  Now, once that chunk of memory is created, who owns it?  The keyboard/mouse generator could retain ownership, but for how long?  Does a recipient have a guarantee as to when it might be freed or reused?  Should the recipient copy it?  How long do they have to copy it before it is stale?  Can they share this chunk of memory with others, or what?  It’s such a simple thing, but it seems so hard to deal with the combinatoric explosion.

Well, there are a certainly a few ways to deal with things.  First, I can go back and look at memory allocation.  I created a couple of simple heap structures a while back, and I recently revisited and updated them based on my work with the multiple threads.  One of the questions I was trying to answer is, can the memory chunk deal with its own destiny?  Said in less philosophical terms, can I use the GC system of Lua to manage the life of a chunk of memory in a reasonable and predictable way.

So, here’s a couple of data structures to deal with Heap allocated memory (Win32):

typedef struct {
    void *    Data;
    int32_t   Size;
    HANDLE    HeapHandle;
    DWORD     owningthread;
} HeapBlob;

typedef struct {
    HANDLE  Handle;
    int     Options;
    size_t  InitialSize;
    size_t  MaximumSize;
    DWORD	owningthread;

There are a couple of ways to allocate stuff in Lua. The regular way, without doing anything fancy, is to just create your thing, and start using it:

local myList = {} = "William"
myList.job = "Programmer"

Eventually, once there are no references to your little list, it will magically disappear, being swept away by the garbage collection system.

Then, there’s allocating stuff using LuaJIT FFI:

local myChunk ="char[256]")

Something allocated in this way will also be garbage collected, when the ‘myChunk’ variable no longer has any references to it. As long as you’re running with a single threaded environment, and you’re not passing your variables off to any other thread to deal with, this will work just fine. But, what if you need to have your chunk of memory last longer, even if you lose all references to it from the creating side.

The third way of creating a chunk of memory is to use a direct system memory allocation call. In the case of Windows, I could do the following:

local ptr = kernel32.HeapAlloc(self.Handle, flags, nbytes)

Ignoring for the moment where that ‘kernel32’ and self.Handle came from, essentially, you’ll get back a pointer to a chunk of memory that Windows knows about. Lua doesn’t know anything about this as a chunk of memory though. I could use ffi and tell the gc about it, and what function to call once ptr is no longer referenced, but really, if that is what you need, then you can just use the previous method of…).

So, here we have a lonely pointer, all allocated, and ready to use. I’d like to capture a bit of information about the creation event. Back to that existential programming, and back to the HeapBlob data structure.

I could do the following to assign the pointer to one of my HelpBlob data structures:

local blob ="HeapBlob", ptr)

In this case, I have a HeapBlob struct, with the Data element initialized to the value the pointer held. So far, nothing has changed over the plain old ordinary, expect for the fact that I’ve just introduced a nice memory leak, wrapped up in an object. I need a bit more meat on the bones of the HeapBlob object. I want it to get in on the act of memory management. Specifically, when there are no more references to the object, I want the HeapBlob to deallocate the chunk of memory, thus, eliminating the memory leak:

So, I create a metatype to associate some functions with this simple data structure:

HeapBlob = nil
HeapBlob_mt = {
    __gc = function(self)
        if self.HeapHandle == nil or self.owningthread == 0 then return nil end

        if self.owningthread == kernel32.GetCurrentThreadId() then
            local success = kernel32.HeapFree(self.HeapHandle, 0, self.Data) ~= 0

    __tostring = function(self)
        return string.format("Blob: Size: %d  Thread: %d",
            self.Size, self.owningthread);

    __index = {
        GetSize = function(self, flags)
            flags = flags or 0

            if self.HeapHandle == nil then return 0 end

            local result = kernel32.HeapSize(self.HeapHandle, flags, self.Data)

            return result

    IsValid = function(self, flags)
        flags = flags or 0
        if self.HeapHandle == nil then return false end
        local isValid = kernel32.HeapValidate(self.HeapHandle, flags, self.Data)
        return isValid

    ReaAlloc = function(self, nbytes, flags)
        flags = flags or 0
        if self.Heap == nil then return false end
        local ptr = kernel32.HeapReAlloc(self.HeapHandle, flags, self.Data, nbytes)
        if ptr == nil then return false end
        self.Data = ptr;
        self.Size = nbytes;
        return true
HeapBlob = ffi.metatype("HeapBlob", HeapBlob_mt)

There’s a few functions in here, but the __gc function is the one that gets called whenever an instance of the HeapBlob goes out of scope, and has no further references to it. So, in the simplest case, you clould do the following:


And we’re back to having a handle on a piece of memory, and nothing will occur at __gc time, because we have not set an owningthreadid, and our self.HeapHandle is nil. Well, this is actually a good thing. This can be seen in a two thread scenario.

-- Thread 1
local ptr = Heap:Alloc(256)
passMemoryPointerToThread(ptr, 256, takeownership=false, heaphandle=nil)

-- Thread 2
local ptr = ReceiveMemoryPointerFromThread(ptr, size, takeownership, heaphandle)
local threadid = 0
if takeownership then threaid = Sys.GetCurrentThreadId() end
local blob = HeapBlob(ptr,size,heaphandle, 0, threadid)

In this scenario, thread 1 will retain responsibility for the allocated memory. It communicates this information first, by not giving the handle to the heap that was used to allocate the chunk of memory, and second by giving a flag indicating its desire. Really the witholding of the heaphandle is enough of a hint.

Thread 2 is free to go ahead and use the HeapBlob, pass it amongst its friends, etc. Then, when it’s no longer using the blob, the structure will be garbage collected, and nothing will happen to the bit of memory.

This HeapBlob could have additional information, such as a ref count, which can be incremented any time there is a new reference to it, but then it starts to look like a COM program.

If you want to pass along the Heap blob, and allow the second thread to take ownership, thread 1 would simply supply the heap handle. In that scenario, thread2 would have all the information it needs to manage the chunk of memory in ways that it deems appropriate, and it has received an explicit handoff from thread1 imploring it to do so.

This starts to give me better control of memory, which allows me to more strictly control the mechanisms between collaborating threads.

Getting back to Existence, if I can isolate and define one bit of existence, then I’m improving my ability to write fairly independent programs. I haven’t answered the question “who am I”, and “how am I different from another” as yet. But, at least now I have a way to communicate that I can control. I suspect answering the “who am I” thing will involve some GUID or other mechanisms, including ‘claims’ and possibly certificates. That will be an interesting exploration.

Super Computing Simulation Sure is Simple

A few years ago I had this idea about computing.  I thought, what is the smallest unit of compute power?  Then I thought about trying to define exactly what I meant by this, and how it might be rendered in reality.  Thus was coined the phrase “Computicle”.  Basically, a particle of computation.

A computicle is a basic unit of computation, not defined by NAND gates and transistors, but by pure definition of words and mechanisms.  I have come to believe the essential attributes of a computicle are the following:

  • Ability to receive input
  • Ability to perform some operations
  • Ability to communicate with other computicles

That’s it, and after reading it, it doesn’t sound like it’s worth much of anything. But, it’s amazing how hard it is to clearly define a small group of words precisely enough that they can be used to express a large variety of outcomes. Look at DNA for instance. There are the 4 base pairs, very simple. And there are rules for which combine with which, again very simple. Given those simple rules, and the soup that life stuff is made of and you can express everything from an ant to a zebra, and everything else in between, including plants.

So, do these computicle attributes buy me anything? Well, I’ll take a step back, and look at why I’d even bother pondering such things in the first place.

Increasingly, putting together a program with any level of complexity is getting harder and harder.  I used to be able to program in assembly, way back when it was 6502, or 8088.  I am completely bewildered by today’s machines, and would much rather rely on a compiler to do it for me.  Putting together a hard core web service is a very challenging task.  There are tons of little pieces, and lots of ways to do it.  Even so, it is at that 6502 assembly stage.  The only problem is, the machine has not been completely defined as yet.  We have bits and pieces such as TCP, JSON, HTTP, storage, and compute.  But, they don’t all flow into a coherent “machine” that can be reasoned about and easily programmed.  As such, there are no ways to really express a web service, such that a compiler could come along and optimize the heck out of it.  We’re all programming in assembly.  Even worse, the assembly is specific to the particular platform and service being deployed.  This is back to the cores and tubes days of early computing.

So, rolling forward, I’d like to use the constraints of the computicles to help me define what it is to create a massively parallel thing.  I’d first like to be able to express the “machine” that I am programming, and then use computicles to express that machine in real terms, and then apply a program to that machine.

Well, that’s kind of lofty stuff.  So, I’d better start much smaller.  I have previously played with running Lua instances in separate threads.  One of the things that I left as an exercise for the reader was the ability to communicate between those threads.  Recently I’ve baked thread support into HeadsUp, and gone the next couple of steps to be able to easily create a thread that runs some lua script.  And, the next step is showing how to actually talk to such a running thread.

First of all, within HeadsUp, there is some code to run a lua script.  It is in fact the same code that HeadsUp itself uses to run a script.  It’s just exposed, and has a signature such that it can be used as a startup routine for the CreateThread call.  The signature looks like this:

DllExport int RunLuaScript(void *s);

Great. Now, I already had this object, LuaScriptThread. This object takes a script, in the form of a lua string, and creates a thread, and sends the script to RunLuaScript() to be executed. This is the easiest way of running a bit of Lua script in a separate thread from whatever thread you so happen to be calling from.

Here is an instance of doing such a thing:

local ffi = require "ffi";

require "LuaScriptThread"

local user32 = ffi.load("user32");
local kernel32 = ffi.load("kernel32");

local path ="char["..(string.len(package.path)+1).."]");
ffi.copy(path, ffi.cast("char *",package.path), string.len(package.path));

function testLooper()
    local looper = LuaScriptThread(simplechunk, path);

    -- Give the thread a chance to start

    local maxIterations = 0xffff;
    local counter = 1;
    local bRet;
    cmd = C.WM_COMMAND;

    while (true) do
        if counter > maxIterations then
            return ;

        bRet = user32.PostThreadMessageA(looper.ThreadId,cmd,counter,0);
        if bRet == 0 then
            local err = C.GetLastError();
            print(string.format("Error: 0x%x", err));
        counter = counter + 1;


There are three parts to this piece of code. The first part is simply concerned with getting the right modules loaded, and creating a copy of the current inclusion path, so it can be passed along to the thread when it starts.

The second part of the program is the creation of the thread of execution:

    local looper = LuaScriptThread(simplechunk, path);

As soon as this line is executed, the OS is creating a separate thread to execute whatever Lua script code is indicated by “simplechunk”. The ‘path’ parameter is being passed along as the single argument to the code. Really, this could be anything at all, including a whole large Lua program in and of itself. This thread startup, and the parameter being passed go hand in hand. Usually, whomever creeates the threads, knows what parameter to pass to the thread on startup. It’s generally a good idea to at least pass the current path, but it completely depends on the code that is to run in the thread.

The last part of this code snippet is concerned with sending messages to the newly created thread. In this case, using PostThreadMessage() will do the trick. On Windows, any thread the starts calling “GetMessage()”, will have a message queue associated with it. This is typically done with apps that have Windows associated with them, but that’s not strictly required. it used to be that you had to create this fake window to get this queue, but these days, it seems to work without having to create that fake window.

PostThreadMessage() is a non-blocking call. You just call it, and you’re done. Whatever you sent will be pulled out of the queue by the thread routine, and they’ll do whatever they want with it.

So, how about that thread routine?

local simplechunk = [[
local ffi = require("ffi");

local path = ffi.string(ffi.cast("char *",tonumber(_ThreadParam)))
package.path = path;

local user32 = ffi.load("User32")

printer = MessagePrinter();

local msg ="MSG")

local bRet = user32.GetMessageA(msg,nil,0,0);

while (bRet ~= 0) do

    bRet = user32.GetMessageA(msg,nil,0,0);

Again, represented by a couple of parts. The first part deals with getting the path, which was passed as a parameter (which shows up as _ThreadParam). That parameter is passed as a string representation of a pointer to the data which represents the argument. It has to be turned back into a number, and then cast to a char *, which can then be converted to a string, and then we can set the package.path. This allows us to then load scripts, just like the original program did.

There is one thing to note about this thread instance of the Lua environment. Unlike the HeadsUp Lua environment, this environment does not have anything but the standard libraries loaded into it, not even the core stuff like class, and the like. So, anything you want to use, you have to “require” or ‘ffi.load’ to get.

But, this is a beach head. From this stub, you can do anything, including recursively creating even more threads. In fact, this is a good way for a “service” to make instances of stuff. But, I digress.

The second half of the program contains the GetMessageA() loop. In this case, I’m doing something very simple. Just get the message, and print it out. The message printer is implemented in a class in a separate file, which is ‘require(d)’ to get it in:

require "BanateCore"


function MessagePrinter:_init()

function MessagePrinter:Receive(msg)
    print(string.format("Message: 0x%x  wParam: 0x%x  lParam: 0x%x",
        msg.message, msg.wParam, msg.lParam));

Again, fairly brain dead simple. And here, I start to see the constraints of computicles expressing themselves. The MessagePrinter might be a computicle. It receives data, and does something.

The thread routine is the same, it receives data, and it does something.

One thing that would help here is to standardize the “receives data” part of things. In this example, I’m using two different mechanisms. In one, I’m using PostThreadMessage() to send data to a computicle. In another, I’m calling the “Receive()” method on an object. In both cases, I’m essentially communicating the same information, just using different communications mechanisms.

Well, that’s easily standarizable. Let’s first encapsulate the data in some payload, which could be defined as:

typedef struct {
    int size;
    char *data[1];

Or something like that. The idea being, whatever data you’ve got, you can pack into a data structure, which tells how big the data is.

After you’ve stuffed the buffer using whatever means available, you’re ready to send the payload off to somewhere. Well, “To Somewhere” could mean many different things. It could be a routine running on a thread, in a process, on an OS, on a machine, in a vault, in the US, or it could be something across the internet. Clearly, some form of addressing needs to occur. The internet has IPV4, but that only gets you halfway there, depending, and that’s the wrong kind of address when I’m communicating with threads within the same process.

Hmmm, what to do… Well, there needs to be an addressing mechanism, so I’ll just wish it into existance. Now, I need a message dispatch mechanism. The message dispatcher will take the address, and the payload, and ensure the payload gets to the address. So, the generalized communications flow looks like:

construct payload
get address of recipient
deliver payload to recipient
[rinse repeat]

Pretty simple in theory, and actually, pretty simple in implementation as well I think.

Now, an interesting thing begins to happen to my program. Payloads can represent protocols. There’s no reason the protocol has to be connected to the transport mechanism… I can talk TCP/IP between threads in a process, without involving the networking stack in any way. As long as I have a payload packer that knows how to pack TCP/IP packets, and a transport that knows how to talk TCP/IP, it can work.

Hmmm, that’s interesting. I can easily implement “Ping” between two threads. Just as easily, or perhaps easier, than I can do it between internet nodes. Well, isn’t that cheerful! And if I can implement these low level protocols, then I can implement higher level protocols as well, because they simply layer atop these lower ones.

OK. Now to the sweet Jesus moment. If I’ve got these simple computicles, which can communicate using a variety of mechanisms, then I can simulate my distributed cloud based service? Why yes, and why not? It’s a royal pain to try and simulate a distributed services that is based in 8 different countries around the world, without actually deploying it for real. I’d like to be able to simulate the latencies, redundant packets, packet loss, down links, and all the rest. Well, with my little computicles, I might begin to have a chance.

Similarly, I want to simulate a deployment of a large scale services. Let’s say 100 nodes, even spread accross 8 data centers. Then I want to do a live update, and take some nodes down in the process.

Oh, and while I’m at it, of course I want to be able to visualize what the heck is going on without much fuss.

Is this total fantasy? Well, at least I’ve nailed down the running of Lua script in a thread, and communicating with it thing. Now it’s a matter of implementing some more intelligent computicles to string together a grammar which can express some machine configurations. Then I’ll be cooking with gas.

Tech Preview 2012

I’ve always wanted to make a predictions list.  So, here it is for 2012, from the obvious to the fringe…

Good quality FDM printers will hit a commodity price of $400

3D Design software will go through a renaissance, resulting in an explosion of new designs from the common man.  As yet unheard of 3D design software will emerge as a new paradigm.

3D object scanners will break out of the DIY realm, and become as common as flatbed scanners ($300).

Electronics will continue to shrink, and become more powerful.  Rather than this power being concentrated in a few data centers, it will become ubiquitous.  “The network is the computer”.

Designer DNA will allow us to create virus/bacteria to suit our needs.

Protein folding at home will become child’s play, to children.

“Computicles” will emerge as the new meme to describe particles of computation.

A good time will be had by all!