Code

Details of the Dataquay v0.9 API changes

Dataquay hasn’t seen a great deal of use yet. I’ve used it in a handful of personal projects that follow the same sort of model as the application it was first designed for, and that’s all.

But I’ve recently started to adapt it to a couple of programs whose RDF usage follows more traditional Linked Data usage patterns (Sonic Visualiser and Sonic Annotator), as a replacement for their fragile ad-hoc RDF C++ interfaces. And it became clear that in these contexts, the API wasn’t really working.

I can get away with changing the API now as Dataquay is still a lightly-used pre-1.0 library. But, for the benefit of the one or two other people out there who might be using it—what has changed, and why?

The main change

The rules for constructing Dataquay Uri, Node and Triple objects have been simplified, at the expense of making some common cases a little longer.

If you want to pass an RDF URI to a Dataquay function, you must now always use a Uri object, rather than a plain Qt string. And to create a Uri object you must have an absolute URI to pass to the Uri constructor, not a relative or namespaced one.

This means in practice that you’ll be calling store->expand() for all relative URIs:

Triple t(store->expand(":bob"), Uri("a"), store->expand("profession:Builder"));

(Note the magic word “a” still gets special treatment as an honorary absolute URI.)

Meanwhile, a bare string will be treated as an RDF literal, not a URI.

Why?

Here’s how the original API evolved. (This bit will be of limited interest to most, but I’ll base a few short conclusions on it in a later post.)

An RDF triple is a set of three nodes, each of which may be a URI (i.e. an “identity” node), a literal (a “data” node) or a blank node (an unnamed URI).

A useful RDF triple is a bit more limited. The subject must be a URI or blank node, and the predicate can only realistically be a URI.

Given these limitations, I arrived at the original API through a test-driven approach. I went through a number of common use cases and tried to determine, and write a unit test for, the simplest API that could satisfy them. Then I wrote the code to implement that, and fixed the situations in which it couldn’t work.

So I first came up with expressions like this Triple constructor:

Triple t(":bob", "a", "profession:Builder");

Here :bob and profession:Builder are relative URIs; bob is relative to the store’s local base URI and Builder is relative to a namespace prefix called “profession:”. When the triple goes into a store, the store will need to expand them (presumably, it knows what the prefixes expand to).

This constructor syntax resembles the way a statement of this kind is encoded in Turtle, and it’s fairly easy to read.

It quickly runs into trouble, though. Because the object part of a subject-predicate-object can be either a URI or a literal, profession:Builder is ambiguous; it could just be a string. So, I introduced a Uri class.[1]

Triple t(":bob", "a", Uri("profession:Builder"));

Now, the Uri object stores a URI string of some sort, but how do I know whether it’s a relative or an absolute URI? I need to make sure it’s an absolute URI already when it goes into the Uri constructor—and that means the store object must expand it, because the store is the thing that knows the URI prefixes.[2]

Triple t(":bob", "a", store->expand("profession:Builder"));

Now I can insist that the Uri class is only used for absolute URIs, not for relative ones. So the store knows which things it has to expand (strings), and which come ready-expanded (Uri objects).

Of course, if I want to construct a Triple whose subject is an absolute URI, I have another constructor to which I can pass a Uri instead of a string as the first argument:

Triple t(Uri("http://example.com/my/document#bob"), "a", store->expand("profession:Builder"));

This API got me through to v0.8, with a pile of unit tests and some serious use in a couple of applications, without complaint. Because it made for simple code and it clearly worked, I was happy enough with it.

What went wrong?

The logical problem is pretty obvious, but surprisingly hard to perceive when the tests pass and the code seems to work.

In the example

Triple t(":bob", "a", store->expand("profession:Builder"));

the first two arguments are just strings: they happen to contain relative URIs, but that seems OK because it’s clear from context that they aren’t allowed to be literals.

Unfortunately, just as in the Uri constructor example above, it isn’t clear that they can’t be absolute URIs. The store can’t really know whether it should expand them or not: it can only guess.

More immediately troublesome for the developer, the API is inconsistent.

Only the third argument is forced to be a Uri object. The other two can be strings, that happen to get treated as URIs when they show up at the store. Of course, the user will forget about that distinction and end up passing a string as the third argument as well, which doesn’t work.

This sort of thing is hard to remember, and puzzling when encountered for the first time independent of any documentation.

In limited unit tests, and in applications that mostly use the higher-level object mapping APIs, problems like these are not particularly apparent. Switch to a different usage pattern involving larger numbers of low-level queries, and then they become evident.


[1] Why invent a new Uri class, instead of using the existing QUrl?

Partly because it turns out that QUrl is surprisingly slow to construct. For a Uri all I need is a typed wrapper for an existing string. Early versions of the library did use QUrl, but profiling showed it to be a significant overhead. Also, I want to treat a URI, once expanded, as a simple opaque signifier rather than an object with a significant interface of its own.

So although Dataquay does use QUrl, it does so only for locations to be resolved and retrieved —network or filesystem locations. The lightweight Uri object is used for URIs within the graph.

[2] Dataquay’s Node and Triple objects represent the content of a theoretical node or triple, not the identity of any actual instance of a node or triple within a store.

A Triple, once constructed, might be inserted into any store—it doesn’t know which store it will be associated with. (This differs from some other RDF libraries and is one reason the class is called Triple rather than Statement: it’s just three things, not a part of a real graph.) So the store must handle expansions, not the node or triple.

2 thoughts on “Details of the Dataquay v0.9 API changes

Comments are closed.