Messing around with UTF-8 in LuaJIT

So, the world has long since left ASCII behind… Well, not actually, there are still plenty of systems on the net that require “7-bit clean”. Most programs are at least 8-bit clean, and the real progressives, for the past 10 years or so, have been at least aware of utf-8.

A couple weeks back, I read this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Having been in software for the past couple of decades, I can vividly remember the progression from ASCII through code pages, and onto Unicode and the myriad encodings.  These days, I stick to utf-8 as the lingua franca of the internet, as those various other encodings are more rare, and just a pain.  But, alas, my beloved Lua does not really have utf8 support, which I think is a crying shame for a modern language.  The Lua community has written a bit about unicode support.  It’s not a satisfying story.  Yes, you can in fact store unicode data in a lua “string”, as the strings themselves are “8-bit clean”, meaning, you can store the full range of characters inside a string as the length is held separately.

Great, so there is in fact storage.  But, there are no support routines built into the language.

What to do, what to do?  Unicode has these things called code points, which is essentially a numeric value for the thousands of characters that can be found in the hundreds of scripts of the languages around the world.  They don’t fit within the initial 127 bytes that represent ASCII characters, so here we are.  The challenge with unicode is how to represent it in a stream of characters.  The easiest way would be to simply use 4 bytes per character… And that’s in fact what the utf-32 encoding does.  But, this is rather unsatisfying from a programmer’s perspective as it takes up a lot of space, and just feels icky (official precise programming terminology).  Of course these days where we have infinite bandwidth and compute power, who cares if you waste a few bytes here and there.  Just compress it and be on your way.

Well, utf-8 is a kind of compression, it favors the scripts where most of the characters can be represented with a single byte, or a couple of bytes, and makes the scripts that need more bytes pay the price.  The wikipedia article does the subject justice.

So, what about Lua?

Ultimately, I’d like to do something like have an encoding attached to a stream, so that I can pull utf-8 encoded strings out of the stream.  Let me start at the beginning then.  Although the basics of utf-8 are fairly… basic, it does take some clever coding to actually do things right, deal with error cases, and be highly performant.  I found one good read on a C99 implementation of a utf-8 decoder.  Of course, my philosophy is all Lua all the time, so I created a pure LuaJIT version of the routine suggested there, and here it is:

local UTF8_ACCEPT = 0
local UTF8_REJECT = 12

local utf8d = ffi.new("const uint8_t[364]", {
  -- The first part of the table maps bytes to character classes that
  -- to reduce the size of the transition table and create bitmasks.
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
   7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
   8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8,

  -- The second part is a transition table that maps a combination
  -- of a state of the automaton and a character class to a state.
   0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12,
  12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12,
  12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12,
  12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12,
  12,36,12,12,12,12,12,12,12,12,12,12,
});

function decode_utf8_byte(state, codep, byte)
  local ctype = utf8d[byte];
  if (state ~= UTF8_ACCEPT) then
    codep = bor(band(byte, 0x3f), lshift(codep, 6))
  else
    codep = band(rshift(0xff, ctype), byte);
  end
  state = utf8d[256 + state + ctype];
  return state, codep;
end

This is essentially a state machine. The first call to the function, you pass in the starting state (UTF8_ACCEPT), as well as the initial codepoint (0), and the first byte of the string. What it returns is the next state, as well as the codepoint accumulated thus far. If the state returned is again UTF8_ACCEPT, then the codepoint now contains a unicode code point. If there is invalid utf8, the state returned will be UTF8_REJECT.

Simple enough. With this bit of state machine, the whole world can be turned on its ear… or at least utf8 strings can be decoded.

Now that basic byte sequences can be turned into UNICODE codepoint values, the next thing to do is think carefully about the many places in which we want this to occur, and what’s the best way to go about using this newfound power.

The first thing I want is an iterator which will run over a set of bytes and return code points:

--[[
Given a UTF-8 string, this routine will feed
out UNICODE code points as an iterator.

Usage:

for codepoint, err in utf8_string_iterator(utf8string) do
  print(codepoint)
end
--]]

function utf8_string_iterator(utf8string, len)
  len = len or #utf8string
  local state = UTF8_ACCEPT
  local codep =0;
  local offset = 0;
  local ptr = ffi.cast("uint8_t *", utf8string)
  local bufflen = len;

  return function()
    while offset < bufflen do
      state, codep = decode_utf8_byte(state, codep, ptr[offset])
      offset = offset + 1
      if state == UTF8_ACCEPT then
        return codep
      elseif state == UTF8_REJECT then
        return nil, state
      end
    end
    return nil, state;
  end
end

Iterators are a wonderful thing. They allow you to view the world in easy terms, and their lazy evaluation makes for fairly low resource utilization in many cases. In this case, I now have an iterator that can be used to return code points for any given utf-8 source. It can be fed either a straight up Lua string, or a buffer of bytes, in either case, the source is cast to a “uint8_t *”, which makes for fast easy traversal without having to pull out a byte at a time from a typical Lua string. So great, now I have this iterator. What to do with it?

The way I look at it, I now have a tool that allows me to pass it a stream of bytes, and it will hand me back code points as I need them. I could do anything from string compares, to string length count, to encoding conversions. The critical part is done, and now I’m dealing with UNICODE codepoint values.

Here is the implementation of a simple “strlen” for utf-8:

function utf8_string_length(utf8string, len)
  local count = 0;
  for codepoint, err in utf8_string_iterator(utf8string,len) do
    count = count + 1
  end
  return count
end

local utf8string = "Here is the string to be tested";
print("Length: ", utf8_string_length(utf8string));

>> 31

Yes, the string here is straight ASCII, so maybe it’s lying to me. But, more interesting strings, that actually have some embedded utf-8 characters, will work as well.

A function that checks the validity of a utf-8 encoded string could also be implemented fairly easily using the same technique that the iterator does. Just loop through, decoding, until you get a UTF8_REJECT as the state.

And lastly, since stream objects exist, it’s fairly easy to attach a utf-8 decoder to a stream. Then, you just start reading from the stream, and whenever you call ReadString(), you can be assured that what you’ll get is a string, that is UNICODE. But, that UNICODE string object doesn’t quite exist yet. That would be an excellent exercise for the reader.


2 Comments on “Messing around with UTF-8 in LuaJIT”

  1. Eric says:

    Thanks for a very cool post. I’m actually working on a comprehensive open source Unicode library for Lua and would love to include your code as an optional optimization for users with LuaJIT. Is it under the same license as the C code that inspired it (MIT, I believe)? The way I’m setting up the library is that it would use the C library or LuaJIT routines either are available, else fall back on the vanilla Lua I’ve written.

    I tried to find an e-mail to write and ask you for permission to use it (with appropriate credit, of course), and to find out if/how the code is licensed, but I couldn’t find one.

    Thanks again for a very well-done and timely post.


Leave a comment