Essential Performance Tuning

At some point in the software development cycle, I turn my attention to performance tuning.  I’d like to think that I design for performance from the start, and I really do try to, but, then there’s the nitty gritty details you have to work out in implementation to ensure you actually achieve the performance you had envisioned in your dreams.

Here is a picture of  a fill rate test of the renderer.  Basically, I’m trying to answer the question, “What’s the fastest I can render a solid triangle?”.  If you’ve even done any graphics, particularly 3D, you’ll recognize that having a very fast triangle render routine is kind of essential to getting highly performant graphics.  In this particular case, I created a framebuffer that was 1920×1080, then I run at a frame rate of anywhere from 15 to 30, and I just keep upping the number of triangles drawn per frame until I can’t hold the frame rate.  Each triangle is randomly between 50 and 100 on each side, so although small, they’re fairly typical of the types that are found while doing some high polygon count 3D rendering.  The amount of time is dominated by triangle setup.  I create a DDA (line walker), do vertex sorting, then walk the DDAs, doing memory fills of a solid color along the way.  And of course, I’m not cleaning much up, so I’m generating a ton of garbage along the way, which must all be cleaned up as well.

With all this, my CPU is sitting at 25-27%, so I’m nowhere near maxxing out my CPU, although, when I look at a single core of my quad proc, it can be sitting at 75-80%.  Is there opportunity for parallelism?  I could certainly imagine staging things such that the triangle scanning is separate from the polygon filling.  I could create a graphics pipeline similar to what OpenGL does.  Breaking things into geometry, vertex, and fragment shaders.  Even more interesting, I could create such a pipeline by creating separate instances of Lua State to run on separate threads, one for each processor core.  Then, send messages between them using shared memory or something.  Not quite the same as the massive number of rendering cores in a GPU, but I could still get some benefit.

Then there’s lower level stuff.  Back to data types, I’ve reconfigured what my data types look like.  Here I had some fundamentals to decide.  The thing I have to decide has to do with coding style more than anything else.  When I look at the vector and matrix classes, how do I want to program them.  Do I want to favor creation of the instances, or do I want to favor ease of performing simple arithmetic, or do I want to favor interop, or can I favor a couple at the same time?

I’m opting for favoring performance, both with and without interop.  So, let’s look at the simplest case, the lowly vec3.  In OpenGL GLSL language, vec3 is:  float[3]  That is, a simple array with 3 float values in it.  In GLSL, there is some syntactic sugar that allows you to do simple arithmetic with vectors as if they were basic types, so you can do:  v3 = v1 + v3

OK, that’s very convenient.  Is it dramatically different than v3 = add(v1,v2)?

In LuaJIT, I can have the convenience of that syntactic sugar, but it requires me to create a struct to represent the vec3, which means that it’s not quite as easy to pass the thing around when it comes time to do interop.  There will always be some sort of pointer reference involved.  Not a bad tradeoff I guess.

But, here’s another way of doing things.

float = ffi.typeof("float")
floatv = ffi.typeof("float[?]")

With these two simple definitions, I can do the following:

local myfloat = float(3.7)
local myfloatarray = floatv(320*240*4, 3.7)

In the first case, I am allocating a single float value.  I can then go and use that float value anywhere as if it were a number.  In some cases, I’ll have to actually convert it to a number to pass to a function that expects a lua number, and can’t deal with the fact that this is actually cdata.  But, as long as I stay within the confines of my own little world here, it’s fine.

In the second case, I’m allocating an array of floats.  Here, it could represent an array of rgba values (4 floats) in a framebuffer that is 320×240.  Well, that’s pretty precise control of things.

Here’s another case where I’m favoring syntax, as well as interop:

vec3 = function(x,y,z) return floatv(3, x,y,z) end
local vertex1 = vec3(10, 20, 30)

That is way convenient!  I had this funny realization that C macros are nothing more than functions in Lua.  Duh!  Even better, they’re not just macros, they’re actually functions.  In this particular case, I want to essentially mimic the vec3 type found in OpenGL GLSL.  So, I’ve already created that convenient floatv type, which will readily create arrays of a particular size for me.  So, I just use that, conveniently creating an array with three elements, and initialize it with whatever the user had passed in.  And there you have it.

Now, when it comes time to do some interop to an OpenGL function, I can simply pass myfloatarray, or vertex1, to a function that expects a pointer to a float array, and that’s the end of that.  At a very low level, passing a single parameter to a function is faster than apssing 3 parameters to a function, so passing a pointer to an array with 3 floats, is faster than calling the functions that require 3 floats, as long as the array is relatively long lived.  Not a lot of marshalling going on, not any conversion from Lua number to float.  The thing is garbage collected in time, and that’s very convenient, and saves me from having to do the whole malloc/free thing that I would normally have to do in ‘C’.  This is why I say Lua might be the new ‘C’, at least for me.  I get many of the benefits of low level stuff in C (minus easy bit twiddling) without all the hassles and pitfalls of bad memory management.

So, that’s a bit of performance tuning.  Choosing the right data structures to meet the needs of the situation.  I am making the design choices at the fundamental structural representation layer.  My next are of focus is networking code.  I need to get maxium connections per processor, and maximum throughput.  We’ll see how it goes.



Leave a comment