Yertle: an RDF/Turtle library in Yeti

I wrote a little while back about the Yeti programming language and the most pleasant time I was having with it, learning and re-learning some functional programming idioms.

One of the first things I set about writing—because I could think of a use for it and it seemed like a nice thing to make—was a parser for the RDF/Turtle language. I’m afraid I couldn’t resist calling my Yeti Turtle module “Yertle”, after the Dr Seuss turtle. You can find the code here on Bitbucket. It is also an in-memory datastore and a Turtle serialiser, but the parser is the complicated bit.

How I got on

I don’t recall ever having written a non-trivial parser before, so this was an exercise in learning how to do that, as well as practice in Yeti. One reason I thought a Turtle parser would be a good exercise is that it’s a simple, human-readable language that I (thought I) already knew.

In fact it’s not as simple as all that: the Turtle grammar is very straightforward, but it’s lexically disappointingly fiddly. And while I had thought it could be processed with only single-character lookahead and no backtracking, this turns out not to be the case.

The result is a recursive-descent parser without a preceding tokenizer, in which the parse functions return either partially constructed triples (complete triples being emitted directly into a state object) or an error variant.

Unwinding errors through parse function return values rather than, say, exceptions seems nice and tidy, but it does lead to a lot of nested decisions:

aOrPrefixedName state =
    case pnameCandidate state of
    OK tok:
        case tok of
        Token "a": OK rdfTypeIRI;
        Token t: prefixExpanded state t;
        esac;
    Error e:
        Error e;
    esac;

However, with this structure we can add a composition operator to apply a chain of parse functions, dropping out when we get one that returns an error:

(~>) a b =
    case a of
    OK result: b result;
    Error e: Error e;
    esac;

making our example

aOrPrefixedName state =
    pnameCandidate state ~>
       \case of
        Token "a": OK rdfTypeIRI;
        Token t: prefixExpanded state t;
        esac;

in which the second part of aOrPrefixedName (the case lambda expression) is evaluated only if the first part (pnameCandidate) succeeds and returns an OK (Token t) variant; otherwise it is short-circuited and the error returned immediately.

This seems to be a pretty basic construct in monadic parsers and the like—being clueless about parsers and even more clueless about monads, I was unaware of this, but this kind of thing drops out by itself when looking at how to simplify a potentially repetitive bit of code.

How well does it work?

The parser passes all the W3C spec tests, plus a few additional unit tests.

Yertle has about half the number of lines of code of David Robillard’s Serd, the cleanest C implementation I know. But Yertle uses quite a number of regular expressions, and I’m not sure I would argue that it’s more readable than Serd. On the other hand, it would certainly be even less readable if I had written it in C.

Yertle is fast enough to handle modest documents in many practical situations, but it isn’t all that fast. It currently takes about four times as long to parse and re-serialise a large Turtle file as rapper does and around ten times as long as the Serd serdi conversion utility. The times aren’t quite comparable because rapper and serdi don’t load to an indexed store on the way through (in fact serdi can convert arbitrarily large documents, which neither rapper nor Yertle can). And neither of those utilities, in the versions I compared with, quite implements the current W3C Turtle spec. But I don’t imagine those things make a huge difference.

(I decided beforehand I’d be happy to be within a factor of 5 of current C parser/serialisers, by which I really meant rapper. But as it turns out, now it’s written, I’d like to be a bit closer than that! I may have a bit of work to do, if only for my own satisfaction.)

Small conclusions about APIs and testing

In my previous post I explained a small but significant API change for v0.9 of the Dataquay library.

Although there was nothing very deep about this change or its causes, I found it interesting partly because I had used a partly test-driven process to evolve the original API and I felt there may be a connection between the process and any resulting problems. Here are a few thoughts prompted by this change.

Passing the tests is not enough

Test-driven development is a satisfying and welcome prop. It allows you to reframe difficult questions of algorithm design in terms of easier questions about what an algorithm should produce.

But producing the right results in every test case you can think of is not enough. It’s possible to exercise almost the whole of your implementation in terms of static coverage, yet still have the wrong API.

In other words, it may be just as easy to overfit the API to the test cases as it is to overfit the test cases to the implementation.

Unit testing may be easier than API design

So, designing a good API is harder than writing tests for it. But to rephrase that more encouragingly: writing tests is easier than designing the API.

If, like me, you’re used to thinking of unit testing as requiring more effort than “just bunging together an API”, this should be a worthwhile corrective in both directions.

API design is harder than you think, but unit testing is easier. Having unit tests doesn’t make it any harder to change the API, either: maintaining tests during redesign is seldom difficult, and having tests helps to ensure the logic doesn’t get broken.

Types are not just annoying artifacts of the programming language

An unfortunate consequence of having worked with data representation systems like RDF mostly in the context of Web backends and scripting languages is that it leads to a tendency to treat everything as “just a string”.

This is fine if your string has enough syntax to be able to distinguish types properly by parsing it—for example, if you represent RDF using Turtle and query it using SPARQL.

But if you break down your data model into individual node components while continuing to represent those as untyped strings, you’re going to be in trouble. You can’t get away without understanding, and somewhere making explicit, the underlying type model.

Predictability contributes to simplicity

A simpler API is not necessarily one that leads to fewer or shorter lines of code. It’s one that leads to less confusion and more certainty, and carrying around type information helps, just as precondition testing and fail-fast principles can.

It’s probably still wrong

I’ve effectively found and fixed a bug, one that happened to be in the API rather than the implementation. But there are probably still many remaining. I need a broader population of software using the library before I can be really confident that the API works.

Of course it’s not unusual to see significant API holes in 1.0 releases of a library, and to get them tightened up for 2.0. It’s not the end of the world. But it ought to be easier and cheaper to fix these things earlier rather than later.

Now, I wonder what else is wrong…

Details of the Dataquay v0.9 API changes

Dataquay hasn’t seen a great deal of use yet. I’ve used it in a handful of personal projects that follow the same sort of model as the application it was first designed for, and that’s all.

But I’ve recently started to adapt it to a couple of programs whose RDF usage follows more traditional Linked Data usage patterns (Sonic Visualiser and Sonic Annotator), as a replacement for their fragile ad-hoc RDF C++ interfaces. And it became clear that in these contexts, the API wasn’t really working.

I can get away with changing the API now as Dataquay is still a lightly-used pre-1.0 library. But, for the benefit of the one or two other people out there who might be using it—what has changed, and why?

The main change

The rules for constructing Dataquay Uri, Node and Triple objects have been simplified, at the expense of making some common cases a little longer.

If you want to pass an RDF URI to a Dataquay function, you must now always use a Uri object, rather than a plain Qt string. And to create a Uri object you must have an absolute URI to pass to the Uri constructor, not a relative or namespaced one.

This means in practice that you’ll be calling store->expand() for all relative URIs:

Triple t(store->expand(":bob"), Uri("a"), store->expand("profession:Builder"));

(Note the magic word “a” still gets special treatment as an honorary absolute URI.)

Meanwhile, a bare string will be treated as an RDF literal, not a URI.

Why?

Here’s how the original API evolved. (This bit will be of limited interest to most, but I’ll base a few short conclusions on it in a later post.)

An RDF triple is a set of three nodes, each of which may be a URI (i.e. an “identity” node), a literal (a “data” node) or a blank node (an unnamed URI).

A useful RDF triple is a bit more limited. The subject must be a URI or blank node, and the predicate can only realistically be a URI.

Given these limitations, I arrived at the original API through a test-driven approach. I went through a number of common use cases and tried to determine, and write a unit test for, the simplest API that could satisfy them. Then I wrote the code to implement that, and fixed the situations in which it couldn’t work.

So I first came up with expressions like this Triple constructor:

Triple t(":bob", "a", "profession:Builder");

Here :bob and profession:Builder are relative URIs; bob is relative to the store’s local base URI and Builder is relative to a namespace prefix called “profession:”. When the triple goes into a store, the store will need to expand them (presumably, it knows what the prefixes expand to).

This constructor syntax resembles the way a statement of this kind is encoded in Turtle, and it’s fairly easy to read.

It quickly runs into trouble, though. Because the object part of a subject-predicate-object can be either a URI or a literal, profession:Builder is ambiguous; it could just be a string. So, I introduced a Uri class.[1]

Triple t(":bob", "a", Uri("profession:Builder"));

Now, the Uri object stores a URI string of some sort, but how do I know whether it’s a relative or an absolute URI? I need to make sure it’s an absolute URI already when it goes into the Uri constructor—and that means the store object must expand it, because the store is the thing that knows the URI prefixes.[2]

Triple t(":bob", "a", store->expand("profession:Builder"));

Now I can insist that the Uri class is only used for absolute URIs, not for relative ones. So the store knows which things it has to expand (strings), and which come ready-expanded (Uri objects).

Of course, if I want to construct a Triple whose subject is an absolute URI, I have another constructor to which I can pass a Uri instead of a string as the first argument:

Triple t(Uri("http://example.com/my/document#bob"), "a", store->expand("profession:Builder"));

This API got me through to v0.8, with a pile of unit tests and some serious use in a couple of applications, without complaint. Because it made for simple code and it clearly worked, I was happy enough with it.

What went wrong?

The logical problem is pretty obvious, but surprisingly hard to perceive when the tests pass and the code seems to work.

In the example

Triple t(":bob", "a", store->expand("profession:Builder"));

the first two arguments are just strings: they happen to contain relative URIs, but that seems OK because it’s clear from context that they aren’t allowed to be literals.

Unfortunately, just as in the Uri constructor example above, it isn’t clear that they can’t be absolute URIs. The store can’t really know whether it should expand them or not: it can only guess.

More immediately troublesome for the developer, the API is inconsistent.

Only the third argument is forced to be a Uri object. The other two can be strings, that happen to get treated as URIs when they show up at the store. Of course, the user will forget about that distinction and end up passing a string as the third argument as well, which doesn’t work.

This sort of thing is hard to remember, and puzzling when encountered for the first time independent of any documentation.

In limited unit tests, and in applications that mostly use the higher-level object mapping APIs, problems like these are not particularly apparent. Switch to a different usage pattern involving larger numbers of low-level queries, and then they become evident.


[1] Why invent a new Uri class, instead of using the existing QUrl?

Partly because it turns out that QUrl is surprisingly slow to construct. For a Uri all I need is a typed wrapper for an existing string. Early versions of the library did use QUrl, but profiling showed it to be a significant overhead. Also, I want to treat a URI, once expanded, as a simple opaque signifier rather than an object with a significant interface of its own.

So although Dataquay does use QUrl, it does so only for locations to be resolved and retrieved —network or filesystem locations. The lightweight Uri object is used for URIs within the graph.

[2] Dataquay’s Node and Triple objects represent the content of a theoretical node or triple, not the identity of any actual instance of a node or triple within a store.

A Triple, once constructed, might be inserted into any store—it doesn’t know which store it will be associated with. (This differs from some other RDF libraries and is one reason the class is called Triple rather than Statement: it’s just three things, not a part of a real graph.) So the store must handle expansions, not the node or triple.

Dataquay

Dataquay is my C++ library for RDF datastore management using the Qt toolkit.

It’s a library for people who happen to be writing C++ applications using Qt and who are interested in managing data that fit well into a subject-predicate-object graph model (as in the Linked Data paradigm, for example).

It uses Qt classes and coding style throughout, and includes an object mapper for store and recall of Qt’s property-based introspectable objects.

The library started out with an interest in exploring RDF as a representational model for data in a traditional document-based editing application that used Qt. In purpose therefore it has more in common with aspects of Core Data or Hibernate than with semantic data frameworks such as Soprano. That is:

High priority

  • Simple datastore API
  • Natural object model
  • Works well with data from local files in the human-compatible RDF/Turtle format
  • Can build in to an application (BSD licence, few external dependencies)

Lower priority

  • Quasi-semantic queries (SPARQL)
  • Organisation of common linked-data namespaces
  • Broad file format support
  • Backend support for relational databases
  • Networking
  • Scalability

It’s a good example of a library developed for an immediate application by one programmer and generalised.

Version 0.9 of Dataquay is out now, and it’s creeping slowly up to a stable 1.0.

There have been some API changes between 0.8 and 0.9, which I’ll post about separately.

Spare us humans from XML

XML appeared in 1996, was refined during 1997, and was standardised in 1998.

I remember a lot of excitement about it at the time, from managers who imagined it would solve all their data portability problems. I was conscious of some of this enthusiasm before I really looked at the format.

When I did, I thought

ugh

But it wasn’t as bad as all that. The idea of XML was to standardise the lexical layer for a file format, helping to cut down on the programmer’s natural tendency to just bodge something up when saving and worry about loading later.

It worked, and it made and still makes a tolerable format for all kinds of things—within limits. Of course it’s horribly verbose and slow to parse, but hey, it compresses well. And you still don’t get reliable interchange unless you have a known storage structure on both sides, something a series of increasingly complex helper standards evolved atop XML to help you with.

One thing XML never was, though, was nice for humans to read.

At the time this seemed OK, because it obviously wasn’t intended to be for humans. We humans would never be editing it. We’d only ever be looking at it through filters and lenses and programs that knew what it really meant. We’d never actually have to see the format.

Fifteen years later, here I am sitting looking at

<if>
    <condition>
        <isset property="tested.project.dir" />
    </condition>
    <then>
        <property name="tested.project.absolute.dir" location="${tested.project.dir}" />
        <xpath input="${tested.project.absolute.dir}/AndroidManifest.xml"
            expression="/manifest/@package" output="tested.manifest.package" />
        <if>
            <condition>
                <isset property="tested.manifest.package" />

Oof! Enough of that.  But that’s Android development: of course that’s for robots.

Windows Phone is for people, though. How about:

<phone:PhoneApplicationPage.ApplicationBar>
    <shell:ApplicationBar Opacity="0">
        <shell:ApplicationBarIconButton Text="previous" IsEnabled="False"
            IconUri="/Shared/Images/appbar.left.png" Click="PreviousButton_Click"/>
        <shell:ApplicationBarIconButton Text="page jump"
            IconUri="/Images/appbar.book.png" Click="JumpButton_Click"/>
        <shell:ApplicationBarIconButton Text="settings" IconUri=

Aargh.

These aren’t pathological examples that I’m having to grub around the internals of some graphical environment in order to find. That last one, for example, is copied verbatim from a beginners’ programming book for Windows Phone 7.

I’d far, far rather see developers use XML than go back to rolling completely unique file formats for every application. But surely by now there are enough widely-supported data representation languages—formats that simplify the presentation by standardising object relationships as well as lexical details, such as JSON or RDF/Turtle—to cover the majority of situations in which humans may end up having to edit something?