Parsing for mere mortalsPosted: May 21, 2012
Lua has inbuilt pattern matching associated with the string class. With it you can do fairly simple matches, such as this:
for line in alltext:gmatch("([^\r\n]*)") do -- do something with each line end
In this case, it’s saying; ‘return every character that is NOT (‘^’) in the set defined by the whitespace characters \r\n.
For such a small pattern, this is just fine, and it works fast and efficiently. But, what if you have a more complex, or more robust task? Lua’s general philosophy is less is more, and small is beautiful. Sticking to that philosophy, Lua does not contain a general “RegEx” library. Most regex libraries are fairly big, bigger than the Lua runtime itself. Instead, there is a fairly obscure module called lpeg. A Parsing Expression Grammar, is slightly different than what I’ve grown up with in terms of parsing technologies. Aside from all the positives the creator has to say about it, I find it fairly useful to do some parsing tasks. These days, I find parsing binary files to be extremely easy, it’s the parsing of text files that gives me fits. Taking something like HTML for example, oh my gosh what a nightmare. Even the lowly UTC DateTime strings have enough corner cases to build a castle.
Here’s a bit of text file that I’d like to be able to parse:
“v 123.45 23 500”
The letter ‘v’ followed by some amount of space, then followed by three numbers (possibly floating piont). How would I do that in LPEG?
First of all, in LPEG, you form patterns to match parts of the thing you’re trying to parse. These patterns are composable, meaning, you can string them together to form more complex patterns from simpler patterns. Once you have a pattern, you feed the pattern and your text to the lpeg.match() function, and you get a result.
Most of the time, parsing tasks need to be broken down into tokens and statements, if you’re into the old school Lex/Yacc stuff. Various other tools are similar. With LPEG, you do some of the same stuff, but I think it’s a bit easier.
First, I need to define things related to numbers. To define a simple pattern, LPEG has a function, lpeg.P(). This function can take a string as a simple pattern to match. Another function, lpeg.S(), allows you to define a set of characters, and lastly, the function lpeg.R() allows you to define a range. There’s one more thing called a capture, lpeg.C(), which allows you to actually get stuff out of the strings you’re matching. With these in hand, you can do the following:
local locale = lpeg.locale() -- to get some ready made sets local P = lpeg.P local R = lpeg.R local S = lpeg.S local C = lpeg.C local whitespace = locale.space -- S' \t\v\n\f' local OPT_WS = whitespace^0; -- Optional whitespace local HWS = whitespace^1; -- Required whitespace local digit = locale.digit -- R("09"); local hex = locale.xdigit -- R("af","AF","09") local exp = S'eE' * S'+-'^-1 * digit^1 local number_sign = S'+-'^-1 local fs = S'fFlL'^0 local is = S'uUlL'^0 local hexnum = P'0' * S'xX' * hex^1 * is^-1 local octnum = P'0' * digit^1 * is^-1 local decnum = digit^1 * is^-1 local floatnum = digit^1 * exp * fs^-1 + digit^0 * P'.' * digit^1 * exp^-1 * fs^-1 + digit^1 * P'.' * digit^0 * exp^-1 * fs^-1 local numlit = hexnum + octnum + floatnum + decnum local NUMBER = number_sign^-1 * numlit local CNUMBER = C(NUMBER)/tonumber
Woah! That’s an eyefull. Yah, but it breaks down fairly easily. First of all, in Lua, if you are passing a single parameter to a function, like a string, or a table, you can eliminate the surrounding ‘()’. It’s a convenience, although it might make the typing look a little more obscure. So, the two following are equivalent:
P'0' P('0') lpeg.P('0')
Let’s dissect the hexnum pattern:
local hexnum = P'0' * S'xX' * hex^1
This is definig a pattern to match strings like this: “0x23FF”
That’s a fairly familiar pattern to want to match. So, first up is the P’0′. That’s a pattern that says: Match on a literal ‘0’.
Next comes the ‘*’. That means “followed by”. The S’xX’ pattern would match anything in the enclosed set ‘x’ or ‘X’. Then again a ‘*’ meaning followed by.
The next one is a bit odd. What’s with the ‘^1’? In general, the ‘^’ will match on the number following it, in this case, 1 or more. It defines the minimum number of occurences that must exist for the previous pattern. It’s tacked onto the end of that hex pattern, which is defined as any of the hex digits (0..9, a..f, A..F), as defined by the locale.xdigit.
Alright, not to lose track, this pattern; hex^1, which means, at least one of, and possibly more of the hex digits.
So, in English, it all looks like this.
A hexnum IS: A zero followed by one of ‘x’ or ‘X’ followed by one or more of the hex digits (0..9, a..f, A..F).
And that’s about that.
Now, if I have a string like this: “0x23ff and the rest”, I can do the following:
local ending = lpeg.match(hexnum, "0x23ff and the rest") print(ending) >>7
Basically, it will return the offset of the end of the match. And if I want to get the actual number that is being matched, I can use the capture…
local ending = lpeg.match(lpeg.C(hexnum), "0x23ff and the rest") print(ending) >>0x23ff
This will return the numeric literal that was found ‘0x23ff’. If you want this as a number, you can do the following (adding the ‘/tonumber’).
local ending = lpeg.match(lpeg.C(hexnum)/tonumber, "0x23ff and the rest") print(ending) >>9215
This is just scratching the surface, and hardly shows much of the LPEG capabilities. But, it does at least show the flexibility of composition with LPEG patterns. In this particular case, it’s not much better than using ‘sscanf()’ in C. It will look a little more interesting with more complex patterns, particularly when trying to parse something like HTML, or HTTP.
Well, there must be another part to this story…