I wrote a little while back about the Yeti programming language and the most pleasant time I was having with it, learning and re-learning some functional programming idioms.
One of the first things I set about writing—because I could think of a use for it and it seemed like a nice thing to make—was a parser for the RDF/Turtle language. I’m afraid I couldn’t resist calling my Yeti Turtle module “Yertle”, after the Dr Seuss turtle. You can find the code here on Bitbucket. It is also an in-memory datastore and a Turtle serialiser, but the parser is the complicated bit.
How I got on
I don’t recall ever having written a non-trivial parser before, so this was an exercise in learning how to do that, as well as practice in Yeti. One reason I thought a Turtle parser would be a good exercise is that it’s a simple, human-readable language that I (thought I) already knew.
In fact it’s not as simple as all that: the Turtle grammar is very straightforward, but it’s lexically disappointingly fiddly. And while I had thought it could be processed with only single-character lookahead and no backtracking, this turns out not to be the case.
The result is a recursive-descent parser without a preceding tokenizer, in which the parse functions return either partially constructed triples (complete triples being emitted directly into a state object) or an error variant.
Unwinding errors through parse function return values rather than, say, exceptions seems nice and tidy, but it does lead to a lot of nested decisions:
aOrPrefixedName state = case pnameCandidate state of OK tok: case tok of Token "a": OK rdfTypeIRI; Token t: prefixExpanded state t; esac; Error e: Error e; esac;
However, with this structure we can add a composition operator to apply a chain of parse functions, dropping out when we get one that returns an error:
(~>) a b = case a of OK result: b result; Error e: Error e; esac;
making our example
aOrPrefixedName state = pnameCandidate state ~> \case of Token "a": OK rdfTypeIRI; Token t: prefixExpanded state t; esac;
in which the second part of aOrPrefixedName
(the case lambda expression) is evaluated only if the first part (pnameCandidate
) succeeds and returns an OK (Token t)
variant; otherwise it is short-circuited and the error returned immediately.
This seems to be a pretty basic construct in monadic parsers and the like—being clueless about parsers and even more clueless about monads, I was unaware of this, but this kind of thing drops out by itself when looking at how to simplify a potentially repetitive bit of code.
How well does it work?
The parser passes all the W3C spec tests, plus a few additional unit tests.
Yertle has about half the number of lines of code of David Robillard’s Serd, the cleanest C implementation I know. But Yertle uses quite a number of regular expressions, and I’m not sure I would argue that it’s more readable than Serd. On the other hand, it would certainly be even less readable if I had written it in C.
Yertle is fast enough to handle modest documents in many practical situations, but it isn’t all that fast. It currently takes about four times as long to parse and re-serialise a large Turtle file as rapper does and around ten times as long as the Serd serdi conversion utility. The times aren’t quite comparable because rapper and serdi don’t load to an indexed store on the way through (in fact serdi can convert arbitrarily large documents, which neither rapper nor Yertle can). And neither of those utilities, in the versions I compared with, quite implements the current W3C Turtle spec. But I don’t imagine those things make a huge difference.
(I decided beforehand I’d be happy to be within a factor of 5 of current C parser/serialisers, by which I really meant rapper. But as it turns out, now it’s written, I’d like to be a bit closer than that! I may have a bit of work to do, if only for my own satisfaction.)
The current SVN version of Serd does implement the latest draft (except the SPARQL PREFIX/BASE garbage), but I’m waiting until it’s final to release. The performance impact shouldn’t be significant, but I haven’t actually tested it yet.
The old version was implementable with a single character peek/eat sort of interface. The new draft (mainly the rules about dots) made implementing a parser dramatically more difficult, which is really unfortunate