Undergraduate programming languages

I read two quite different articles about programming in academia today.

I don’t know Yossi Kreinin, and when his piece Why bad scientific code beats code following “best practices” appeared on the Hacker News front page, I guessed that I probably wouldn’t agree with it. I’m a programmer working in academia who has spent some time trying to find ways to improve the code that researchers write, and I don’t want to be told I’m wasting my time.

But on the whole I agreed with him. A totally naive programmer is going to produce messy, organic, but basically linear code that will usually be easier to understand and work with than the code of someone who has learned a bit of Java and wants to put everything in a factory class. The part I don’t agree with of course, and that I suspect he put in just for the sake of the headline, is that these really are best practices.

I think that one of the hidden goals of a project like Software Carpentry is to teach that the reason software developers appear to be “too clever”, and somehow unreachable for the scientist outside the software field, is that they are being too clever. Programming gets over-complicated; you can learn to write good code (that is, readable code) more easily than you can learn to write bad code the professional programmers’ way.

I also identified with Yossi’s line that “I claim to have repented, mostly”. Most of the code I’ve written in my career is not very good, by this standard. It’s a long path to enlightenment.

The other article was by my friend Christophe Rhodes: What is a good language to teach at undergraduate level for Computing degrees?

A computing degree is an odd thing. Computer science is the theory of how computers and programs work. Computing, as a subject, is computer science plus some stuff about how we should actually program them in order to get a job done. The two are very different.

Christophe breaks down objectives of a computing degree: Think, experiment, job, career, study, society. I’m not very familiar with the study objective, which refers to postgraduate study in a computing department (something I never did). Think refers to teaching students how to think and solve problems computationally. I think he may be missing a step, and it may be useful to separate thinking about how the computer works (“understand”?) from thinking about how the programmer can work.

There is also, of course, the question of how to avoid making your programmer feel too clever.

As an undergraduate in 1990-94, I was taught, in this order, Prolog, C, ML, 68k assembly language, and C++. I also learned some Lisp, though I can’t remember being formally taught it. Prolog, C, and the assembly language were all good bases for what I referred to as the understand objective, getting something out of the history of computing and learning about how computers work. ML was a wonderful introduction to thinking, and it’s no coincidence that I’ve recently been programming in an ML variant as an engaging alternative to work.

The hard one to evaluate is C++. It was a very poor teaching language for object-oriented programming, which is a pity because at the time we learned it, I didn’t get object-oriented programming at all. And we learned nothing about any other kind of programming from it, having already been taught three different high-level languages. So it contributed very poorly to the think objective and not at all to the understand one. It’s an awful language to learn to write clear, reliable code in, therefore bad from the society perspective, and you can (as I have) spend 20 years learning to write it, which is obviously ridiculous, hence you would think also bad from the job perspective.

But it turns out that C++ has been the most valuable job language there could be, because (a) it is resiliently portable: it turns out that write-once-run-anywhere with a virtual machine comes and goes in waves, depending on the whim of the top operating system provider of the time, but compiling to machine code seems to be eternal; (b) it is just about able to ape a number of programming paradigms, so you can get away with adopting a style without adopting another language; and (c) its complexity means that it looks good on your CV. I honestly wish, as a 20+-year C++ programmer, that we could make it so that nobody ever had to learn C++ again, but even now I don’t think that is the case.

The one goal completely unaddressed by my own degree, and almost every language I’ve learned since, is experiment. I think there are two essentials for this: a responsive interactive interpreter, and visual responsiveness, like a graphical scene environment or plotting tools. Most of the things you want to do here are probably growing up in Javascript and HTML5.

I don’t really know anything about teaching, about the actual on-the-ground business of making people learn stuff. So you shouldn’t listen to me on this next bit, I’m just daydreaming. But if I were planning a 3-year computing degree course in my head now, I think I would aim to teach, in this order:

  • Python, for the basics of procedural computing. It’s the cleanest language for doing satisfying simple loops and input/output transformations, and is a sound general-purpose language.
  • Clojure. I’ve decided now that I don’t so much care for Lisp syntax day-to-day, but a Lisp gives you so much to talk about and investigate without getting lost in the specific requirements of the language. And a Lisp on the JVM gives you more depth: advanced students could learn quite a lot about Java without ever actually being taught Java.
  • Some assembly language, probably ARM and preferably with a nice visible bit of circuit board on hand.
  • Javascript, but not aiming to teach the language so much as using a specific interactive framework and with a specific small game-development project to complete. I would probably also use Typescript rather than untyped Javascript.
  • Haskell. As an ML guy it was always my enemy, but Haskell is the functional language that has endured. It’s a follow-on course from Lisp.
  • C++. Because, as far as I can see at the moment, you pretty much have to. But please, give them modern-style pointer ownership and RAII (i.e. avoiding explicit heap allocation).
  • Python again, in a closing course that taught people how best to do things in an actual working environment. Testing, not being too much of a smartarse, etc.

But the odd thing about that set of languages is that, just as with my own degree, you never get a very good dedicated object-oriented language. Maybe Objective-C could replace C++; it’s probably a clearer pedagogical object language, but C++ is everywhere while Objective-C is effectively platform-specific, even if it is a very popular platform.

And be sure to teach them version control.

Oh, and don’t forget to add a double-entry book-keeping course. It’s probably more useful than the programming stuff.

 

Small conclusions about APIs and testing

In my previous post I explained a small but significant API change for v0.9 of the Dataquay library.

Although there was nothing very deep about this change or its causes, I found it interesting partly because I had used a partly test-driven process to evolve the original API and I felt there may be a connection between the process and any resulting problems. Here are a few thoughts prompted by this change.

Passing the tests is not enough

Test-driven development is a satisfying and welcome prop. It allows you to reframe difficult questions of algorithm design in terms of easier questions about what an algorithm should produce.

But producing the right results in every test case you can think of is not enough. It’s possible to exercise almost the whole of your implementation in terms of static coverage, yet still have the wrong API.

In other words, it may be just as easy to overfit the API to the test cases as it is to overfit the test cases to the implementation.

Unit testing may be easier than API design

So, designing a good API is harder than writing tests for it. But to rephrase that more encouragingly: writing tests is easier than designing the API.

If, like me, you’re used to thinking of unit testing as requiring more effort than “just bunging together an API”, this should be a worthwhile corrective in both directions.

API design is harder than you think, but unit testing is easier. Having unit tests doesn’t make it any harder to change the API, either: maintaining tests during redesign is seldom difficult, and having tests helps to ensure the logic doesn’t get broken.

Types are not just annoying artifacts of the programming language

An unfortunate consequence of having worked with data representation systems like RDF mostly in the context of Web backends and scripting languages is that it leads to a tendency to treat everything as “just a string”.

This is fine if your string has enough syntax to be able to distinguish types properly by parsing it—for example, if you represent RDF using Turtle and query it using SPARQL.

But if you break down your data model into individual node components while continuing to represent those as untyped strings, you’re going to be in trouble. You can’t get away without understanding, and somewhere making explicit, the underlying type model.

Predictability contributes to simplicity

A simpler API is not necessarily one that leads to fewer or shorter lines of code. It’s one that leads to less confusion and more certainty, and carrying around type information helps, just as precondition testing and fail-fast principles can.

It’s probably still wrong

I’ve effectively found and fixed a bug, one that happened to be in the API rather than the implementation. But there are probably still many remaining. I need a broader population of software using the library before I can be really confident that the API works.

Of course it’s not unusual to see significant API holes in 1.0 releases of a library, and to get them tightened up for 2.0. It’s not the end of the world. But it ought to be easier and cheaper to fix these things earlier rather than later.

Now, I wonder what else is wrong…

Details of the Dataquay v0.9 API changes

Dataquay hasn’t seen a great deal of use yet. I’ve used it in a handful of personal projects that follow the same sort of model as the application it was first designed for, and that’s all.

But I’ve recently started to adapt it to a couple of programs whose RDF usage follows more traditional Linked Data usage patterns (Sonic Visualiser and Sonic Annotator), as a replacement for their fragile ad-hoc RDF C++ interfaces. And it became clear that in these contexts, the API wasn’t really working.

I can get away with changing the API now as Dataquay is still a lightly-used pre-1.0 library. But, for the benefit of the one or two other people out there who might be using it—what has changed, and why?

The main change

The rules for constructing Dataquay Uri, Node and Triple objects have been simplified, at the expense of making some common cases a little longer.

If you want to pass an RDF URI to a Dataquay function, you must now always use a Uri object, rather than a plain Qt string. And to create a Uri object you must have an absolute URI to pass to the Uri constructor, not a relative or namespaced one.

This means in practice that you’ll be calling store->expand() for all relative URIs:

Triple t(store->expand(":bob"), Uri("a"), store->expand("profession:Builder"));

(Note the magic word “a” still gets special treatment as an honorary absolute URI.)

Meanwhile, a bare string will be treated as an RDF literal, not a URI.

Why?

Here’s how the original API evolved. (This bit will be of limited interest to most, but I’ll base a few short conclusions on it in a later post.)

An RDF triple is a set of three nodes, each of which may be a URI (i.e. an “identity” node), a literal (a “data” node) or a blank node (an unnamed URI).

A useful RDF triple is a bit more limited. The subject must be a URI or blank node, and the predicate can only realistically be a URI.

Given these limitations, I arrived at the original API through a test-driven approach. I went through a number of common use cases and tried to determine, and write a unit test for, the simplest API that could satisfy them. Then I wrote the code to implement that, and fixed the situations in which it couldn’t work.

So I first came up with expressions like this Triple constructor:

Triple t(":bob", "a", "profession:Builder");

Here :bob and profession:Builder are relative URIs; bob is relative to the store’s local base URI and Builder is relative to a namespace prefix called “profession:”. When the triple goes into a store, the store will need to expand them (presumably, it knows what the prefixes expand to).

This constructor syntax resembles the way a statement of this kind is encoded in Turtle, and it’s fairly easy to read.

It quickly runs into trouble, though. Because the object part of a subject-predicate-object can be either a URI or a literal, profession:Builder is ambiguous; it could just be a string. So, I introduced a Uri class.[1]

Triple t(":bob", "a", Uri("profession:Builder"));

Now, the Uri object stores a URI string of some sort, but how do I know whether it’s a relative or an absolute URI? I need to make sure it’s an absolute URI already when it goes into the Uri constructor—and that means the store object must expand it, because the store is the thing that knows the URI prefixes.[2]

Triple t(":bob", "a", store->expand("profession:Builder"));

Now I can insist that the Uri class is only used for absolute URIs, not for relative ones. So the store knows which things it has to expand (strings), and which come ready-expanded (Uri objects).

Of course, if I want to construct a Triple whose subject is an absolute URI, I have another constructor to which I can pass a Uri instead of a string as the first argument:

Triple t(Uri("http://example.com/my/document#bob"), "a", store->expand("profession:Builder"));

This API got me through to v0.8, with a pile of unit tests and some serious use in a couple of applications, without complaint. Because it made for simple code and it clearly worked, I was happy enough with it.

What went wrong?

The logical problem is pretty obvious, but surprisingly hard to perceive when the tests pass and the code seems to work.

In the example

Triple t(":bob", "a", store->expand("profession:Builder"));

the first two arguments are just strings: they happen to contain relative URIs, but that seems OK because it’s clear from context that they aren’t allowed to be literals.

Unfortunately, just as in the Uri constructor example above, it isn’t clear that they can’t be absolute URIs. The store can’t really know whether it should expand them or not: it can only guess.

More immediately troublesome for the developer, the API is inconsistent.

Only the third argument is forced to be a Uri object. The other two can be strings, that happen to get treated as URIs when they show up at the store. Of course, the user will forget about that distinction and end up passing a string as the third argument as well, which doesn’t work.

This sort of thing is hard to remember, and puzzling when encountered for the first time independent of any documentation.

In limited unit tests, and in applications that mostly use the higher-level object mapping APIs, problems like these are not particularly apparent. Switch to a different usage pattern involving larger numbers of low-level queries, and then they become evident.


[1] Why invent a new Uri class, instead of using the existing QUrl?

Partly because it turns out that QUrl is surprisingly slow to construct. For a Uri all I need is a typed wrapper for an existing string. Early versions of the library did use QUrl, but profiling showed it to be a significant overhead. Also, I want to treat a URI, once expanded, as a simple opaque signifier rather than an object with a significant interface of its own.

So although Dataquay does use QUrl, it does so only for locations to be resolved and retrieved —network or filesystem locations. The lightweight Uri object is used for URIs within the graph.

[2] Dataquay’s Node and Triple objects represent the content of a theoretical node or triple, not the identity of any actual instance of a node or triple within a store.

A Triple, once constructed, might be inserted into any store—it doesn’t know which store it will be associated with. (This differs from some other RDF libraries and is one reason the class is called Triple rather than Statement: it’s just three things, not a part of a real graph.) So the store must handle expansions, not the node or triple.

Dataquay

Dataquay is my C++ library for RDF datastore management using the Qt toolkit.

It’s a library for people who happen to be writing C++ applications using Qt and who are interested in managing data that fit well into a subject-predicate-object graph model (as in the Linked Data paradigm, for example).

It uses Qt classes and coding style throughout, and includes an object mapper for store and recall of Qt’s property-based introspectable objects.

The library started out with an interest in exploring RDF as a representational model for data in a traditional document-based editing application that used Qt. In purpose therefore it has more in common with aspects of Core Data or Hibernate than with semantic data frameworks such as Soprano. That is:

High priority

  • Simple datastore API
  • Natural object model
  • Works well with data from local files in the human-compatible RDF/Turtle format
  • Can build in to an application (BSD licence, few external dependencies)

Lower priority

  • Quasi-semantic queries (SPARQL)
  • Organisation of common linked-data namespaces
  • Broad file format support
  • Backend support for relational databases
  • Networking
  • Scalability

It’s a good example of a library developed for an immediate application by one programmer and generalised.

Version 0.9 of Dataquay is out now, and it’s creeping slowly up to a stable 1.0.

There have been some API changes between 0.8 and 0.9, which I’ll post about separately.

Wrapping a C++ library with JNI, part 4

In this series…

  • Introduction, outlining the general steps from starting with a C++ library to being able to build and run simple tests on some JNI wrappers;
  • Part 1, in which I design some simple Java classes and generate the stub wrapper code;
  • Part 2, in which I add just enough of the implementation to be able to do a test build;
  • Part 3, discussing object lifecycles in C++ and Java;
  • Part 4 (this post), the final episode covering a few remaining points of interest.

A More Complex Method

The most complex function in our API is the process method in the Plugin class. In C++, this is

typedef std::vector<Feature> FeatureList;
typedef std::map<int, FeatureList> FeatureSet;

virtual FeatureSet process(const float *const *inputBuffers,
                           RealTime timestamp) = 0;

That is, the process method of a Plugin object takes a two-dimensional array of floats and a RealTime object as arguments, and returns a map from an integer to a FeatureList, which is a sequence of Feature objects stored in a vector. That’s fairly complicated.

Java lacks typedef or any very satisfactory alternative. So, we render this as

public native Map<Integer, ArrayList<Feature>>
    process(float[][] inputBuffers, RealTime timestamp);

where Feature and RealTime are classes we must provide separately. (It’s good form to declare the return type using an interface such as Map in cases where the caller doesn’t need to care which specific container implementation is used. In this case our concrete return type should probably be a TreeMap.)

I’m not going to explain every detail of the implementation of this in the JNI wrapper, but I want to illustrate a couple of aspects:

Constructing Java objects from JNI code

In this particular case, the objects we want to return take rather a lot of work to construct. However, we can illustrate a simple example. At the innermost level, all of our returned objects have type Feature, a Java class we define which has a constructor that needs no arguments.

To construct one of these, first we need to look up its class:

jclass featClass = env->FindClass("org/vamp_plugins/Feature");

Then we find the method in the class that corresponds to the constructor. The GetMethodID function is used for looking up all non-static methods, with the method name supplied as the first string argument. For a constructor, the method name is <init>. The final argument gives the signature of the method to look up; in this case ()V means a method taking no arguments with a void return type.

jmethodID ctor = env->GetMethodID(featClass, "<init>", "()V");

Then, to construct the object we simply call the constructor:

jobject feature = env->NewObject(featClass, ctor);

Handling Generics through JNI

This is dead easy. Generic types are there only for the purpose of compiler type-checking: they don’t exist in the virtual machine or in the .class file.

In terms of Java objects, including from the perspective of any JNI code, our ArrayList<Feature> is simply ArrayList.

So, to construct the TreeMap<Integer, ArrayList<Feature>> we intend to return from our function, we only need to do this:

jclass treeMapClass = env->FindClass("java/util/TreeMap");
jmethodID treeMapCtor = env->GetMethodID(treeMapClass, "<init>", "()V");
jobject map = env->NewObject(treeMapClass, treeMapCtor);

Of course we then need to make sure the objects we put in it are of the right type, as the Java compiler is no longer there to help us with type-checking.

Similarly, if a generic container gets passed to a native function, we can (and must) just ignore the type specialisation when unpacking it.

Getting Data from Multi-Dimensional Arrays

My process function takes a two-dimensional array of floats as one of its arguments. (This actually represents multi-channel audio sample data.)

Multi-dimensional arrays in Java are easier to deal with reliably than in C++. An array is an object, so a two-dimensional array is an array of array objects. It appears in the JNI implementation as a jobjectArray:

jobject
Java_org_vamp_1plugins_Plugin_process
    (JNIEnv *env, jobject obj,
     jobjectArray inputBuffers, jobject timestamp);

JNI provides simple accessor methods for getting at array elements: to pull out an object from an array of objects, we use GetObjectArrayElement. That will enable us to get hold of the second dimension of our arrays.

Then, to access a sequence of non-object type elements such as our float values all at once, we need to use a symmetrical pair of calls: one to lock the elements in place so the garbage collector can’t get at them, and the other to release them again. To wit, GetFloatArrayElements and ReleaseFloatArrayElements.

Putting these together to retrieve our two-dimensional float array in C++, we have as the body of our process function:

int channels = env->GetArrayLength(data);
float **input = new float *[channels];

for (int c = 0; c < channels; ++c) {
    jfloatArray cdata =
        (jfloatArray)env->GetObjectArrayElement(data, c);
    input[c] = env->GetFloatArrayElements(cdata, 0);
}

Then we do something with the C array that is now stored in input, hang on to the output for a moment, and tidy up:

for (int c = 0; c < channels; ++c) {
    jfloatArray cdata =
        (jfloatArray)env->GetObjectArrayElement(data, c);
    env->ReleaseFloatArrayElements(cdata, input[c], 0);
}

delete[] input;

and we’re done. It remains only to construct and return our rather complex return value based on whatever we calculated with the input array earlier.

That’s all

That’s all for this series—thanks for reading, and I hope it’s been useful.

Wrapping a C++ library with JNI, part 3

In this series…

  • Introduction, outlining the general steps from starting with a C++ library to being able to build and run simple tests on some JNI wrappers;
  • Part 1, in which I design some simple Java classes and generate the stub wrapper code;
  • Part 2, in which I add just enough of the implementation to be able to do a test build;
  • Part 3 (this post), discussing object lifecycles in C++ and Java;
  • Part 4, the final episode covering a few remaining points of interest.

My previous post on this topic ended with a very simple test program built and run successfully, wrapping a small subset of the C++ library I was referring to.

That’s all I want to go through step-by-step, but there are still a couple more things to explain. The first one is…

Disposing of C++ objects

Java objects are garbage-collected: you don’t have to delete them when you’ve finished with them.

C++ objects are not. Unless they’re locally scoped objects allocated on the stack (i.e. created without a call to new), you have to delete them explicitly when you’ve finished with them, or else you’re creating a resource leak.

The wrapper code I’ve been describing uses Java objects that “own” their corresponding C++ objects. My Plugin object for example contains an opaque nativeObject handle which, on the C++ side, gets interpreted as a pointer to the C++ object that provides the plugin implementation. This object belongs to the Java Plugin wrapper, and it should have a corresponding lifetime which we need to manage. Nothing else is going to delete it if we forget to.

This means we need the Java object to delete its C++ object when it is itself deleted. But a Java object isn’t deleted explicitly, and it has no C++-style destructor function we could make this happen from.

We have two options:

  1. Add a method to the Java class that deletes the C++ object, and insist that users of our library remember to call it;
  2. Delete the C++ object from the Java class’s finalize method, which is called by the JVM when the Java object is garbage collected.

The second option, using finalize, is attractive because it reduces the burden on our users. Deleting the object would be automatic. If we take the first option, any method we add to delete our object will be one that isn’t required for most Java classes, and that also doesn’t exist in the same form in the C++ library we’re wrapping—so developers coming from either Java or C++ will quite likely overlook it.

But there are several problems with using finalize for this.

One problem is that, because the JVM only controls memory and resources within the Java heap, it can’t take into account any resources allocated in the native layer when determining which Java objects it needs to garbage-collect: our Java objects will seem smaller than they effectively are. So it might not finalize our objects quickly enough to prevent memory pressure.

A second problem is that we can’t guarantee which thread the finalize method will be called from. It’s quite important that our C++ object is deleted from the same thread that allocated it.

Finally, it’s possible that the C++ library we are wrapping might expect objects to be deleted in some particular order relative to one another, and we can’t enforce that if we leave the garbage collector to do it. When in C++, we should do as C++ programmers do.

And that means we need to get the user of our library to tell us when to delete objects: we must at least take our option 1. That is, we add a method called dispose to the Java class: when it’s called, it deletes the underlying native object. But, of course, our user—the programmer writing to our library—must remember to call it.

(Why “dispose”? Well, it’s a good name, and it’s the name used by some other libraries that rely on native code, such as the SWT widget toolkit.)

In our Plugin class, the dispose call would be simply

    public native void dispose();

and the implementation in Plugin.cpp, using the native handle helper method referred to in my earlier post, would be along the lines of

void
Java_org_vamp_1plugins_Plugin_dispose(JNIEnv *env, jobject obj)
{
    Plugin *p = getHandle(env, obj);
    setHandle(env, obj, 0);
    delete p;
}

In principle we could also make the finalize method call dispose in case the user forgets. But that’s probably unwise, as the library’s behaviour could become quite unpredictable in subtle ways if the pattern and order of deallocations depended on whether the user had remembered to dispose an object or had left it to the garbage collector.

Another possibility might be to test within finalize whether the object has been disposed, and print a warning to System.err if it hadn’t.

Coming next

In my next (and final) post on this subject, I’ll give an illustration of a function with rather more complex argument and return types than the ones we’ve seen so far.