Solid Software Simply Scales

I’ve been doing a lot of network programming of late, and this has forced me to think very small thoughts.  With network programming, it’s the exceptions that are truly the rule.  Mostly I’m dealing with non-blocking sockets, so I have to be very attentive to the error return values.  This kind of work is very suited for those who really get into the details of such things.  I can do it for a while, but after some time, my tired eyes, and wandering brain start to move on, and begin to miss out on the corner cases.

What I’ve decided to do is really focus on the minutia, and corner cases early on, when my brain is still fresh, and my eyes can catch the subtle details and corner cases.

Case in point, Being able to do a “readline()” is a very useful thing for much network programming.  You see this most in protocols such as HTTP where a “line” is significant as a delimiter.  Things get off the rails right from the start though.  If you read the HTTP spec (1.1) I believe it clearly states that a line is terminated with a CR\LF (\r\n) combination.  Well, things being what they are, in reality, lines can end with the CR\LF combination, or they could just end with a simple LF (\n).

It is very convenient to have a simple ReadLine() routine as part of my stream objects.  The first case I deal with is that of the memory stream.  I want to implement the ReadLine() function.  But, before I can even read a line, I have to be able to read a byte.  Seems simple enough.  But right off the bat again, there’s something to decide.  When you go to read a byte, you could encounter a couple different cases.

  • There is a byte available, so you just return that byte, and no error
  • There are no more bytes available, so you return nil, and ‘eof’ as the ‘error’

This can be seen in this implementation (From LAPHLibs):

function MemoryStream:ReadByte()
  local buffptr = ffi.cast("const uint8_t *", self.Buffer);

  local pos = self.Position
  if pos < self.BytesWritten then
    self.Position = pos + 1
    return buffptr[pos];
  end

  return nil, "eof"
end

To dissect a little bit, the cast at the beginning is needed because the memory stream could be fed by a Lua string, or some other random cdata object. The cast ensures we are in fact dealing with a byte array.

This is a fairly low level function, but right away it establishes a strong meaning which is the foundation for all subsequent functions, such as ReadLine(), or ReadString(), or the Bytes() iterator.

The basic is, if there’s a byte, return that byte. If there isn’t, return nil.

With this basic function in hand, we can examine the ReadLine() implementation.

-- Read characters from a stream until the specified
-- ending is found, or until the stream runs out of bytes
local CR = string.byte("\r")
local LF = string.byte("\n")

function MemoryStream:ReadLine(maxbytes)
  local readytoberead = self:BytesReadyToBeRead()
  maxbytes = maxbytes or readytoberead

  local maxlen = math.min(maxbytes, readytoberead)
  local buffptr = ffi.cast("const uint8_t *", self.Buffer);

  local nchars = 0;
  local bytesconsumed = 0;
  local startptr = buffptr + self.Position
  local abyte
  local err

  for n=1, maxlen do
    abyte, err = self:ReadByte()
    if not abyte then
      break
    end

    bytesconsumed = bytesconsumed + 1

    if abyte == LF then
      break
    elseif abyte ~= CR then
      nchars = nchars+1
    end
  end

  -- End of File, nothing consumed
  if bytesconsumed == 0 then
    return nil, "eof"
  end

  -- A blank line
  if nchars == 0 then
    return ''
  end

  -- an actual line of data
  return ffi.string(startptr, nchars);
end

To break it down, first of all, you can specify the maximum number of bytes you want to examine looking for a line. This allows you to put some sanity into the program. By default, it will examine the entire length of the memory buffer.

The main loop is pretty straight forward. If a LF (\n) is seen, the loop terminates. If a ‘nil’ is seen, the loop terminates. Otherwise, any character, other than ‘CR’ will advance the number of characters that will be copied. There is a ‘bug’ here in that if the are multiple
CRs in a sequence, followed by valid characters, the number of characters reported will be wrong. This is truly expecting CR to be followed by nothing other than LF. This error condition can be easily detected and coded for. You would have to decide whether you’ll allow a single CR to terminate a valid string, or whether you’d return nil, and report an error.

At any rate, moving right along, once the loop terminates, you decide what to return to the caller. It depends on why the loop was terminated.
If the first ‘byte’ read was nil (not to be confused with a numeric value of zero), then ‘bytesconsumed’ will still be zero. In this case, we truly did start at the ‘eof’, and we should just return ‘nil’ and ‘eof’ as the error.

If the termination was due to a blank line, then nchars == 0, meaning, the number of ‘valid’ characters read was zero. In this case, we read a ‘blank’ line. So, we want to return ”, and no error. This is a very subtle case. You might think to simply return a nil, as well, but nope, that will throw things off. A blank line must be defined as what happens when you read nothing but the line terminator.

The next case is when nchars is greater than zero. In this final case, create a Lua string object, and return that.

Of course, this is not the most performant thing in the world to to. Ideally you’d return a simple offset telling where in the pointer the terminator was found. Then the caller could decide how they want to deal with memory allocation, if they want to allocate at all. Perhaps They’re fine just keeping the pointer and offset for later use. But, this is the easy man’s way of doing things.

With these two basics in hand, the following can occur.

function test_ReadOnly()
local str = [[
This is the first line
And the second and third combined
Followed by the fourth

Fifth
And finally the sixth.
]]

  local mstream = MemoryStream.Open(str, #str, #str)

  repeat
    local line, err = mstream:ReadLine()
    print(string.format("LINE:'%s'", tostring(line)), err);
  until err == "eof"
end

I’m doing a couple of things here. First of all, I’m constructing a memory stream on top of a Lua string. That allows me to have stream semantics on any lua string, which is nice and handy for other purposes.

I’ve created a string literal here. Depending on what editor/platform I’m using, each of those ‘lines’ is either terminated with a ‘LF’ or “CR\LF” combination. This is ok, because the ReadLine() function will deal with either case appropriately.

The repeat loop confidently reads individual lines from the stream, printing them out one by one.

And that’s it. Very foundational stuff, which starts out fairly simply, and builds from there. I’ve noted the cases where the thing will work, where it won’t, and how to fix the problems, and performance concerns if you care to. With this simple couple of functions in hand, I can confidently deal with something like the HTTP protocol, assured that for the most part, I don’t have to worry about whether my simple line parsing is working correctly or not. I can easily understand the code, and the base assumptions. I can add more error checking, if I feel I need that, and I can tune the performance. I won’t change the basic semantics of the ReadByte() routine, because that’s the bedrock upon which the higher level functions are based.

And there you have it.

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s