Performance, structures matter…Posted: March 30, 2012
Performance tuning is always fun. The very best programmers in the world, know how each and every choice they make will have an effect based on their machine architecture, runtime, compiler, etc. I guess I’m not one of those types of programmers, because I’ve long since lost track of how what I code affect lowest level microcode. But, I occasionally make crude attempts to speed things up at a much higher level.
One area I focused on recently was my basic frame buffer. I had originally coded this as a single array of bytes, or whatever data structure was specified. I had this convenient FixedArray2D class which made allocating them easy, by wrapping up some ffi goop to get at native data structures. The downside of this class was that it had “SetElement” and “GetElement”. Each of those calls did a calculation to determine where in the vast byte array the particular element was. OK, that’s a no-brainer. Mostly that calculation is going to be very quick, and maybe even inlined, etc. But, how about just doing away with it and letting the compiler deal with it, just like in C.
So, I ditched it. Now, I have some other convenience functions that do the job much more nicely and succinctly.
Array2D = function(columns, rows, kind) return ffi.new(string.format("%s[%d][%d]", kind, rows, columns)) end
Which can be used thus:
local window = Array2D(captureWidth, captureHeight, "pixel_BGRA_b") local graphPort = Array2DRenderer.Create(captureWidth, captureHeight, window, "pixel_BGRA_b")
In this case, I’m creating a pixel buffer which is a certain size, and filled with a BGRA pixel type. Then I construct a renderer on top of that, and move on with life. This works out quite nicely and gives you easy access to the data using normal 2 dimensional matrix access:
window[row][column] = value
Not only is this much more convenient, it turns out to be much faster as well. I guess the runtime is taking care of the appropriate calculations, inlining, turning it into machine code, and done. No more function calls in the way or anything. Just plain fast code. I’ve even changed the renderer to use regular array assignments. For example, with LineH (draw a horizontal line), I used to do a memcpy essentially. Now, I figure, iterating over the locations, making assignments, might be just as fast as the memcpy, so I just do that, and let the compiler figure out how to optimize it, as this should be easily optimizable.
In the end, what did this get me? Well, the graphics fill rate has gone up. I can now draw thousands of tiny little triangles in realtime (30 fps), without much sweat to the system. That same task was getting bogged down at around 2048 triangles using my previous structures. So, this is an improvement.
Now I’m casting an eye towards matrix and vector speedups. At the moment, I represent my matrix class as a 16 element array. This is convenient for OpenGL, but it’s pretty inconvenient for virtually everything else. Again, I have to litter the code with offsets and the like, and probably hurt my performance. I’ll just switch this to being an Array2D of “double” and see where that gets me. I could just use Lua Tables, but I’m not sure if there’s a higher cost to Lua tables doing the appropriate lookus, or if there’s a higher cost converting types between lua numbers and the native types I’m storing in my structures (typically float). We’ll see. Since multiplying a vec3 by a mat4 is the hotpath in graphics processing, making this path as fast as possible is a very desirable thing.
In the meantime, fill rates have gone up. This bodes well for other data structures, like the ones needed to compress bits of video screen before sending them off to the network. Fast access, at a very low level will be highly beneficial there.