Rubber Band Library v1.8.2

I have finally managed to get together all the bits that go into a release of the Rubber Band library, and so have just released version 1.8.2.

The Rubber Band library is a software library for time-stretching and pitch-shifting of audio, particularly music audio. That means that it takes a recording of music and adjusts it so that it plays at a different speed or at a different pitch, and if desired, it can do that by changing the speed and pitch “live” as the music plays. This is impossible to do perfectly: essentially you are asking software to recreate what the music would have sounded like if the same musicians had played it faster, slower, or in a different key, and there just isn’t enough information in a recording to do that. It changes the sound and is absolutely not a reversible transformation. But Rubber Band does a pretty nice job. For anyone interested, I wrote a page (here) with a technical summary of how it does it.

I originally wrote this library between 2005 and 2007, with a v1.0 release at the end of 2007. My aim was to provide a useful tool for open source GPL-licensed audio applications on Linux, like Ardour or Rosegarden, with a commercial license as an afterthought. As so often happens, I seriously underestimated the work involved in getting the library from “working” (a few weeks of evening and weekend coding) to ready to use in production applications (two years).

It has now been almost six years since the last Rubber Band release, and since this one is just a bugfix release, we can say the library is pretty much finished. I would love to have the time and mental capacity for a version 2: there are many many things I would now do differently. (Sadly, the first thing is that I wouldn’t rely on my own ears for basic testing any more—in the intervening decade my hearing has deteriorated a lot and it amazes me to think that I used to accept it as somehow authoritative.)

In spite of all the things I would change, I think this latest release of version 1 is pretty good. It’s not the state-of-the-art, but it is very effective, and is in use right now in professional audio applications across the globe. I hope it can be useful to you somehow.

 

Repoint: A manager for checkouts of third-party source code dependencies

I’ve just tagged v1.0 of Repoint, a tool for managing library source code in a development project. Conceptually it sits somewhere between Mercurial/Git submodules and a package manager like npm. It is intended for use with languages or environments that don’t have a favoured package manager, or in situations where the dependent libraries themselves aren’t aware that they are being package-managed. Essentially, situations where you want, or need, to be a bit hands-off from any actual package manager. I use it for projects in C++ and SML among other things.

Like npm, Bundler, Composer etc., Repoint refers to a project spec file that you provide that lists the libraries you want to bring in to your project directory (and which are brought in to the project directory, not installed to a central location). Like them, it creates a lock file to record the versions that were actually installed, which you can commit for repeatable builds. But unlike npm et al, all Repoint actually does is clone from the libraries’ upstream repository URLs into a subdirectory of the project directory, just as happens with submodules, and then report accurately on their status compared with their upstream repositories later

The expected deployment of Repoint consists of copying the Repoint files into the project directory, committing them along with everything else, and running Repoint from there, in the manner of a configure script — so that developers generally don’t have to install it either. It’s portable and it works the same on Linux, macOS, or Windows. Things are not always quite that simple, but most of the time they’re close.

At its simplest, Repoint just checks stuff out from Git or whatever for you, which doesn’t look very exciting. An example on Windows:

repoint

Simple though Repoint’s basic usage is, it can run things pretty rigorously across its three supported version-control systems (git, hg, svn), it gets a lot of annoying corner cases right, and it is solid, reliable, and well-tested across platforms. The README has more documentation, including of some more advanced features.

Is this of any use to me?

Repoint might be relevant to your project if all of the following apply:

  • You are developing with a programming language or environment that has no obvious single answer to the “what package manager should I use?” question; and
  • Your code project depends on one or more external libraries that are published in source form through public version-control URLs; and
  • You can’t assume that a person compiling your code has those libraries installed already; and
  • You don’t want to copy the libraries into your own version-control repo to form a Giant Monorepo; and
  • Most of your dependent libraries do not similarly depend on other libraries (Repoint doesn’t support recursive dependencies at all).

Beyond mere relevance, Repoint might be actively useful to your project if any of the following also apply:

  • The libraries you’re using are published through a mixture of version-control systems, e.g. some use Git but others Mercurial or Subversion; or
  • The libraries you’re using and, possibly, your own project might change from one version-control system to another at some point in the future.

See the README for more caveats and general documentation.

Example

The biggest current example of a project using Repoint is Sonic Visualiser. If you check out its code from Github or from the SoundSoftware code site and run its configure script, it will call out to repoint install to get the necessary dependencies. (On platforms that don’t use the configure script, you have to run Repoint yourself.)

Note that if you download a Sonic Visualiser source code tarball, there is no reference to Repoint in it and the Repoint script is never run — Repoint is very much an active-developer tool, and it includes an archive function that bundles up all the dependent libraries into a tarball so that people building or deploying the end result aren’t burdened with any additional utilities to use.

I also use Repoint in various smaller projects. If you’re browsing around looking at them, note that it wasn’t originally called Repoint — its working title in earlier versions was vext and I haven’t quite finished switching the repos over. Those earlier versions work fine of course, they just use different words.

 

A film camera

I take a lot of photos and I share some of them online via the antique medium of Flickr. Not many people look at them, which I don’t mind, because I imagine my audience to be (a) family and (b) myself, later. Photos I take with people in them are usually visible only to my friends and family. I’m a person who takes photographs, not a photographer.

But I do take some joy in the practice of photography. That’s partly because I can: at my level, there is very little to it: somebody else has done all the hard work. There is massive, long-term, highly technically sophisticated labour behind every functional detail of image capture and reproduction, which all culminates, for routine takers-of-photos like me, in pressing a button or tapping the screen and deciding whether you like the resulting image or not. It’s a ritual that has delivered a spurious feeling of creativity to people for decades, prefiguring the internet age.

There are four categories of potential joy in a photo, and they go in this order:

  1. Looking at whatever it is you’re thinking of taking a photo of
  2. Solving technical problems, or just fiddling with the camera
  3. Enjoying the picture itself
  4. Finding the pic again later and reminiscing

Obviously, photos of your friends in the pub can skip categories 1 (except in a social sense) and 2. Very deliberate landscape photos might have a lot of categories 1, 2, and 3 but not a great deal of category 4. Please understand that I am vaguely blathering about this category thing because it seemed to make sense while I was typing this, not because I think it’s any kind of real system.

I started out taking photos on film, then moved to a digital camera in 2003. By then a digital camera already gave you more pleasure in the likely quality and serendipity of your pictures, good for categories 3 and 4 above. I did keep a film SLR — a Zenit EM, the cheapest second-hand SLR available when I bought it in 1991 — but it’s very clumsy to use, and the category-2 joy you would imagine getting from it never really materialised.

In search of that sort of joy, I recently bought a slightly fancier second-hand film camera from someone on eBay. I wanted something still mostly manual, but more like the kind of thing I never had access to when young. I decided to buy a Minolta, and I’m not ashamed to say that was mostly because I like the old Minolta logo, before they introduced the more familiar Saul Bass-designed all-caps logotype in 1981. The older logo is verging on Comic Sans in its friendliness, and gives the camera a sweet face:

Minolta XG-9

This manual-focus, manual-aperture, automatic exposure Minolta XG-9 dates from about 1980. It was the cheaper of Minolta’s two SLR ranges at the time; the body probably cost slightly more than a 48K ZX Spectrum home computer. It would have been a nice and very practical camera. It also embodies a mind-boggling amount of mechanical complexity compared with modern equipment. To the right is a schematic of the winder mechanism, one of dozens of such illustrations found in the service manual. Winder schematic for XG-9

It was sold to me with a lens at least a decade older than the camera, a splendid-looking chunk of metal with an impressive (and apparently rather 1960s) wide front glass element.

Anyway, I bought this for my category 2, the fiddling. But so far what I’ve appreciated most is the thing I gave as category 1: just looking at the real scene more closely. Without a screen to review photos on, you have to assume that your photo will fail and you will never get to see this again. If an image should appear again, days later when you get the film processed, it’s a fresh delight.

The downside is that the economics of successful photos still apply. That is, only one in ten shots is any good. With a smartphone or digital camera you can take a hundred photos and have ten you really like. With film you buy a 36-photo roll, get three or four decent photos, but have to visit the print shop twice and spend at least £15 buying and developing the film. (Though I was surprised to find that you can still get film processed at Snappy Snaps.)

And how are the photos? Well, it’s a bit like listening to vinyl records. It’s nice for things that benefit from a bit of roughness and vigour, like this kind of thing:

Westbourne Terrace

(That one wanted a lens hood to prevent the flares in the middle and right, but I didn’t have one at the time.)

Or for snaps of people:

41329457102_3b09963088_c

I like both of those a lot, but I’ve yet to get any really successful landscape or “still-life” pics from it and I suspect I never will, now that I’m used to a cleaner, higher resolution digital image.

Will I be using it much? Am I going to carry it around everywhere, but take far fewer and more selective photos than I otherwise might? Probably not, but it might not be up to me anyway. These are fairly solid cameras, but this one is nearly 40 years old and has a few electronic bits as well as sensitive mechanical parts. They do fail in various ways and I don’t entirely trust that it’s going to be still working the next time I want to use it. That primitive but sturdy Zenit will probably have the last laugh.

 

What does a convolutional neural net actually do when you run it?

Convolutional neural networks (or convnets or CNNs) are a staple of “deep learning”. There are many tutorials available that describe what they do, either mathematically or via quasi-mystical appeals to intuition, and introduce how to train and use them, often with image classification examples.

This post has a narrower focus. As a programmer, I am happy processing floating-point data in languages like C++, but I’m more likely to be building applications with pre-trained models than training new ones. Say I have a pre-trained convnet image classifier, and I use it to carry out a single classification of one image (“tell me whether this shows a pig, cow, or sheep”) — what does that actually mean, in terms of code?

I’ll go through an example in the form of a small hand-written C++ convolutional net that contains only the “execution” code, no training logic. It will be very much an illustrative implementation, not production code. It will use pre-trained data adapted from a similar model in Keras, on which more later.

The specific example I am using is a typical image classification one: identifying which of five types of flower is visible in a small colour image (inspired by this tutorial). All the code I’m describing can be found in this Github repository.

Networks, layers, weights, training

A network in this context is a pipeline of functions through which data is passed, with the output of each function going directly to the input of the next one. Each of these functions is known as a layer.

Our example will begin by supplying the image data as input to the first layer, and end with the final layer returning a set of five numbers, one for each kind of flower the network knows about, indicating how likely the network thinks it is that the image shows that kind of flower.

So to run one classification, we just have to get the image data into the right format and pass it through each layer function in turn, and out pops the network’s best guess.

The layers themselves perform some mathematical operation on the data. Some of them do so by making use of a set of other values, provided separately, which they might multiply with the inputs in some way (in which case they are known as weights) or add to the return values (known as biases).

Coming up with suitable weight and bias values for these layers is what training is for. To train a network, the weights and biases for each such layer are randomly initialised, then the network is repeatedly run to provide classifications of known inputs; each of the network’s guesses is compared with the known class of the input, and the weights and biases of the network are updated depending on how accurate the classification was. Eventually, with luck, they will converge on some reliable values. Training is both difficult and tedious, so here we will happily leave it to machine-learning researchers.

These trainable layers are then typically sandwiched by fixed layers that provide non-linearities (also known as activation functions) and various data-manipulation bits and bobs like max-pooling. More on these later.

The layers are pure functions: a layer with a given set of weights and biases will always produce the same output for a given input. This is not always true within recurrent networks, which model time-varying data by maintaining state in the layers, but it is true of classic convolutional nets.

Our example network

arch3

This small network consists of four rounds of convolution + max-pooling layers, then two dense layers.

I constructed and trained this using Keras (in Python) based on a “flower pictures” dataset downloaded from the Tensorflow site, with 5 labelled classes of flowers. My Github repository contains a script obtain-data.sh which downloads and prepares the images, and a Python program with-keras/train.py which trains this network on them and exports the trained weights into a C++ file. The network isn’t a terribly good classifier, but that doesn’t bother me here.

I then re-implemented the model in C++ without using a machine-learning library. Here’s what the pipeline looks like. This can be found in the file flower.cpp in the repository. All of the functions called here are layer functions defined elsewhere in the same file, which I’ll go through in a moment. The variables with names beginning weights_ or biases_ are the trained values exported from Keras. Their definitions can be found in weights.cpp.

vector<float>
classify(const vector<vector<vector<float>>> &image)
{
    vector<vector<vector<float>>> t;

    t = zeroPad(image, 1, 1);
    t = convolve(t, weights_firstConv, biases_firstConv);
    t = activation(t, "relu");
    t = maxPool(t, 2, 2);

    t = zeroPad(t, 1, 1);
    t = convolve(t, weights_secondConv, biases_secondConv);
    t = activation(t, "relu");
    t = maxPool(t, 2, 2);

    t = zeroPad(t, 1, 1);
    t = convolve(t, weights_thirdConv, biases_thirdConv);
    t = activation(t, "relu");
    t = maxPool(t, 2, 2);

    t = zeroPad(t, 1, 1);
    t = convolve(t, weights_fourthConv, biases_fourthConv);
    t = activation(t, "relu");
    t = maxPool(t, 2, 2);

    vector<float> flat = flatten(t);
    
    flat = dense(flat, weights_firstDense, biases_firstDense);
    flat = activation(flat, "relu");

    flat = dense(flat, weights_labeller, biases_labeller);
    flat = activation(flat, "softmax");

    return flat;
}

Many functions are used for more than one layer — for example we have a single convolve function used for four different layers, with different weights, bias values, and input shapes each time. This reuse is possible because layer functions don’t retain any state from one call to the next.

You can see that we pass the original image as input to the first layer, but subsequently we just take the return value from each function and pass it to the next one. These values have types that we are expressing as nested vectors of floats. They are all varieties of tensor, on which let me digress for a moment:

Tensors

For our purposes a tensor is a multidimensional array of numbers, in our case floats, of which vectors and matrices are special cases. A tensor has a shape, which we can write as a list of sizes, and the length of that list is the rank. (The word dimension can be ambiguous here and I’m going to try to avoid it… apart from that one time I used it a moment ago…)

Examples:

  • A matrix is a rank-2 tensor whose shape has two values, height and width. The matrix \begin{bmatrix}1.0&3.0&5.0\\2.0&4.0&6.0\end{bmatrix} has shape [2, 3], if we are giving the height first.
  • A C++ vector of floats is a rank-1 tensor, whose shape is a list of one value, the number of elements in the vector. The vector { 1.0, 2.0, 3.0 } has the shape [3].
  • A single number can also be considered a tensor, of rank 0, with shape [].

All of the values passed to and returned from our network layer functions are tensors of various shapes. For example, a colour image will be represented as a tensor of rank 3, as it has height, width, and a number of colour channels.

In this code, we are storing tensors using C++ vectors (for rank 1), vectors of vectors (for rank 2), and so on. This is transparent, makes indexing easy, and allows us to see the rank of each tensor directly from its type.

Production C++ frameworks don’t do it this way — they typically store tensors as values interleaved into a single big array, wrapped up in a tensor class that knows how to index into it. With such a design, unlike our code, all tensors will have the same C++ class type regardless of their rank.

There is a practical issue with the ordering of tensor indices. Knowing the shape of a tensor is not enough: we also have to know which index is which. For example, an RGB image that is 128 pixels square has shape [128, 128, 3] if we index the height, width, and colour channels in that order, or [3, 128, 128] if we separate out the individual colour channels and index them first. Unfortunately, both layouts are in common use. Keras exposes the choice through a flag and uses the names channels_last for the former layout, historically the default in Tensorflow as it often runs faster on CPUs, and channels_first for the latter, used by many systems as it parallelises better with GPUs. The code in this example uses channels_last ordering.

This definition of a tensor is good enough for us, but in mathematics a tensor is not just an array of numbers but an object that sits in an algebraic space and can be manipulated in ways intrinsic to that space. Our data structures make no attempt to capture this, just as a C++ vector makes no effort to capture the properties of a mathematical vector space.

Layers used in our model

Each layer function takes a tensor as input, which I refer to within the function as in. Trained layers also take further tensors containing the weights and biases for the layer.

Convolution layer

Here’s the nub of this one, implemented in the convolve function.

for (size_t k = 0; k < nkernels; ++k) {
    for (size_t y = 0; y < out_height; ++y) {
        for (size_t x = 0; x < out_width; ++x) {
            for (size_t c = 0; c < depth; ++c) {
                for (size_t ky = 0; ky < kernel_height; ++ky) {
                    for (size_t kx = 0; kx < kernel_width; ++kx) {
                        out[y][x][k] +=
                            weights[ky][kx][c][k] *
                                in[y + ky][x + kx][c];
                    }
                }
            }
            out[y][x][k] += biases[k];
        }
    }
}

Seven nested for-loops! That’s a lot of brute force.

The weights for this layer, supplied in a rank-4 tensor, define a set of convolution kernels (or filters). Each kernel is a rank-3 tensor, width-height-depth. For each pixel in the input, and for each channel in the colour depth, the kernel values for that channel are multiplied by the surrounding pixel values, depending on the width and height of the kernel, and summed into a single value which appears at the same location in the output. We therefore get an output tensor containing a matrix of (almost) the same size as the input image, for each of the kernels.

(Note for audio programmers: FFT-based fast convolution is not generally used here, as it works out slower for small kernels)

Vaguely speaking, this has the effect of transforming the input into a space determined by how much each pixel’s surroundings have in common with each of the learned kernel patterns, which presumably capture things like whether pixels have a common colour or whether there is common activity between a pixel and neighbouring pixels in a particular direction due to edges or lines in the image.

Usually the early convolution layers in an image classifier will explode the amount of data being passed through the network. In our small model, the first convolution layer takes in a 128×128 image with 3 colour channels, and returns a 128×128 output for each of 32 learned kernels: nearly ten times as much data, which will then be reduced gradually through later layers. But fewer weight values are needed to describe a convolution layer compared with the dense layers we’ll see later in our network, so the size of the trained layer (what you might call the “code size”) is relatively smaller.

I said above that the output matrices were “almost” the same size as the input. They have to be a bit smaller, as otherwise the kernel would overrun the edges. To allow for this, we precede the convolution layer with a…

Zero-padding layer

This can be found in the zeroPad function. It creates a new tensor filled with zeros and copies the input into it:

for (size_t y = 0; y < in_height; ++y) {
    for (size_t x = 0; x < in_width; ++x) {
        for (size_t c = 0; c < depth; ++c) {
            out[y + pad_y][x + pad_x][c] = in[y][x][c];
        }
    }
}

All this does is add a zero-valued border around all four edges of the image, in each channel, in order to make the image sufficiently bigger to allow for the extent of the convolution kernel.

Most real-world implementations of convolution layers have an option to pad the input when they do the convolution, avoiding the need for an explicit zero-padding step. In Keras for example the convolution layers have a padding parameter that can be either valid (no extra padding) or same (provide padding so the output and input sizes match). But I’m trying to keep the individual layer functions minimal, so I’m happy to separate this out.

Activation layer

Again, this is something often carried out in the trained layers themselves, but I’ve kept it separate to simplify those.

This introduces a non-linearity by applying some simple mathematical function to each value (independently) in the tensor passed to it.

Historically, for networks without convolution layers, the activation function has usually been some kind of sigmoid function mapping a linear input onto an S-shaped curve, retaining small differences in a critical area of the input range and flattening them out elsewhere.

The activation function following a convolution layer is more likely to be a simple rectifier. This gives us the most disappointing acronym in all of machine learning: ReLU, which stands for “rectified linear unit” and means nothing more exciting than

if (x < 0.0) {
    x = 0.0;
}

applied to each value in the tensor.

We have a different activation function right at the end of the network: softmax. This is a normalising function used to produce final classification estimates that resemble probabilities summing to 1:

float sum = 0.f;
for (size_t i = 0; i < sz; ++i) {
    out[i] = exp(out[i]);
    sum += out[i];
}
if (sum != 0.f) {
    for (size_t i = 0; i < sz; ++i) {
        out[i] /= sum;
    }
}

Max pooling layer

The initial convolution layer increases the amount of data in the network, and subsequent layers then reduce this into a smaller number of hopefully more meaningful higher-level values. Max pooling is one way this reduction happens. It reduces the resolution of its input in the height and width axes, by taking only the maximum of each NxM block of pixels, for some N and M which in our network are both always 2.

From the function maxPool:

for (size_t y = 0; y < out_height; ++y) {
    for (size_t x = 0; x < out_width; ++x) {
        for (size_t i = 0; i < pool_y; ++i) {
            for (size_t j = 0; j < pool_x; ++j) {
                for (size_t c = 0; c < depth; ++c) {
                    float value = in[y * pool_y + i][x * pool_x + j][c];
                    out[y][x][c] = max(out[y][x][c], value);
                }
            }
        }
    }
}

Flatten layer

Interleaves a rank-3 tensor into a rank-1 tensor, otherwise known as a single vector.

for (size_t y = 0; y < height; ++y) {
    for (size_t x = 0; x < width; ++x) {
        for (size_t c = 0; c < depth; ++c) {
            out[i++] = in[y][x][c];
        }
    }
}

Why? Because that’s what the following dense layer expects as its input. By this point we are saying that we have used as much of the structure in the input as we’re going to use, to produce a set of values that we now treat as individual features rather than as a structure.

With a library that represents tensors using an interleaved array, this will be a no-op, apart from changing the nominal rank of the tensor.

Dense layer

A dense or fully-connected layer is an old-school neural network layer. It consists of a single matrix multiplication, multiplying the input vector by the matrix of learned weights, then a vector addition of bias values. From our dense function:

vector<float> out(out_size, 0.f);

for (size_t i = 0; i < in_size; ++i) {
    for (size_t j = 0; j < out_size; ++j) {
        out[j] += weights[i][j] * in[i];
    }
}

for (size_t j = 0; j < out_size; ++j) {
    out[j] += biases[j];
}

Our network has two of these layers, the second of which produces the final prediction values.

That’s pretty much it for the code. There is a bit of boilerplate around each function, and some extra code to load in an image file using libpng, but that’s about all.

There is one other kind of layer that appears in the original Keras model: a dropout layer. This doesn’t appear in our code because it is only used while training. It discards a certain proportion of the values passed to it as input, something that can help to make the following layer more robust to error when trained.

Further notes

Precision and repeatability

Training a network is a process that involves randomness. A given model architecture can produce quite different results on separate training runs.

This is not the case when running a trained network to make a classification or prediction. Two implementations of a network that both use the same basic arithmetic datatypes (e.g. 32-bit floats) should produce the same results when given the same set of trained weights and biases. If you are trying to reproduce a network’s behaviour using a different library or framework, and you have the same trained weights to use with both implementations, and your results look only “sort of” similar, then they’re probably wrong.

The differences between implementations in terms of channel ordering and indexing can be problematic in this respect. Consider the convolution function with its seven nested for-loops. Load any of those weights with the wrong index or in the wrong order and of course it won’t work, and there are a lot of possible permutations there. There seems to be more than one opinion about how kernel weights should be indexed, for example, and it’s easy to get them upside-down if you’re converting from one framework to another.

Should I use code like this in production?

Probably not!

First and most trivially, this isn’t a very efficient arrangement of layers. Separating out the zero-padding and activation functions means more memory allocation and copying.

Second, there are many ways to speed up the expensive brute-force layers (that is, the convolution and dense layers) dramatically even on the CPU, using vectorisation, careful cache and memory management, and faster matrix and convolution algorithms. If you don’t take advantage of these, you’re wasting time for no purpose; if you do, you’re repeating work other people are doing in libraries already.

I adapted the example network to use the tiny-dnn header-only C++ library, and it ran roughly twice as fast as the code above. (I’m a sucker for libraries named tiny-something, even if, as in this case, they are no longer all that tiny.) That version of the example code can be found in the with-tiny-dnn directory in the repo. It’s ugly, because I had to load and convert the weights from a system that uses channels-last format into one that expects channels-first. But I only had to write that code once, and it wouldn’t have been necessary if I’d trained the model in a channels-first layout. I might also have overlooked some simpler way to do it.

Third, this approach is fragile in the face of changes to the model architecture, which have to be duplicated exactly across the model implementation and the separately-loaded weights. It would be preferable to load the model architecture into your program, rather than load the weights into a hand-written architecture. To this end it seems to make sense to use a library that supports importing and exporting models automatically.

All this said, I find it highly liberating to realise that one could just sketch out these few lines of code and have a working result.

 

Notes on Idris

Brady-TDDI-HI.png

Idris Book cover

In a bid to expand my programming brain by learning something about “dependent types”, I recently bought the Idris book.

(Idris is a pure functional programming language that is mostly known for supporting dependent types. Not knowing what that really meant, and seeing that this recently-published book written by the author of the language was warmly reviewed on Amazon, I saw an opportunity.)

The idea behind dependent typing is to allow types to be declared as having dependencies on information that would, in most languages, only be known at runtime. For example, a function might accept two arrays as arguments, but only work correctly if the two arrays that are actually passed to it have the same length: with dependent types, this dependency can be written into the type declaration and checked at compile time. On the face of it we can’t in general know things like the number of elements in an arbitrary array at compile time, so this seems like compelling magic. Types are, in fact, first class values, and if that raises a lot of horribly recursive-seeming questions in your mind, then this book might help.

A number of academic languages are exploring this area, but Idris is interesting because of its ambition to be useful for general-purpose programming. The book is pleasingly straightforward about this: it adopts an attitude that “this is how we program, and I’ll help you do it”. In fact, the Idris book might be the most inspiring programming book I’ve read since ML for the Working Programmer (although the two are very different, more so than their synopses would suggest). It’s not that the ideas in it are new — although they are, to me — but that it presents them so crisply as to project a tantalising image of a better practice of programming.

What the book suggests

The principles in the book — as I read it — are:

  • We always write down the type of a function before writing the function. Because the type system is so expressive, this gives the compiler, and the programmer, an awful lot of information — sometimes so much that there remains only one obvious way to satisfy the type requirements when writing the rest of the function. (The parallel with test-driven development, which can be a useful thinking aid when you aren’t sure how to address a problem, presumably inspired the title of the book.)
  • Sometimes, we don’t even implement the rest of the function yet. Idris supports a lovely feature called “holes”, which are just names you plonk down in place of code you haven’t got around to writing yet. The compiler will happily type-check the rest of the program as if the holes were really there, and then report the holes and their types to you to give you a checklist of the bits you still need to fill in.
  • Functions are pure, as in they simply convert their input values to their return values and don’t modify any other state as they go along, and are if possible total, meaning that every input has a corresponding return value and the function won’t bail out or enter an infinite loop. The compiler will actually check that your functions are total — the Halting Problem shows that this can’t be done in general, but apparently it can be done enough of the time to be useful. The prospect of having a compiler confirm that your program cannot crash is exciting even to someone used to Standard ML, where typing is generally sound but inexhaustive cases and range errors still happen.
  • Because functions are pure, all input and output uses monads. That means that I/O effects are encapsulated into function return types. The book gives a supremely matter-of-fact introduction to monadic I/O, without using the word monad.
  • We use an IDE which supports all this with handy shortcuts — there is an impressive Idris integration for Atom which is used extensively throughout the book. For an Emacs user like me this is slightly offputting. There is an Emacs Idris mode as well, it’s just that so much of the book is expressed in terms of keystrokes in Atom.

None of these principles refers to dependent types at all, but the strength of the type system is what makes it plausible that this could all work. Idris does a lot of evaluation at compile time, to support the type system and totality checking, so you always have the feeling of having run your program a few times before it has even compiled.

So, I enjoyed the book. To return to earth-based languages for a moment, it reminds me a bit of Bjarne Stroustrup’s “A Tour of C++”, which successfully sets out “modern C++” as if most of the awkward history of C++ had never happened. One could see this book as setting out a “modern Haskell” as if the actual Haskell had not happened. Idris is quite closely based on Haskell, and I think, as someone who has never been a Haskell programmer, that much of the syntax and prelude library are the same. The biggest difference is that Idris uses eager evaluation, where Haskell is lazily evaluated. (See 10 things Idris improved over Haskell for what appears to be an informed point of view.)

Then what happened?

How did I actually get on with it? After absorbing some of this, I set out to write my usual test case (see Four MLs and a Python) of reading a CSV file of numbers and adding them up. The goal is to read a text file, split each line into number columns, sum the columns across all lines, then print out a single line of comma-separated sums — and to do it in a streaming fashion without reading the whole file into memory.

Here’s my first cut at it. Note that this doesn’t actually use dependent types at all.

module Main

import Data.String

parseFields : List String -> Maybe (List Double)
parseFields strs =
  foldr (\str, acc => case (parseDouble str, acc) of
                           (Just d, Just ds) => Just (d :: ds)
                           _ => Nothing)
        (Just []) strs

parseLine : String -> Maybe (List Double)
parseLine str =
  parseFields $ Strings.split (== ',') str

sumFromFile : List Double -> File -> IO (Either String (List Double))
sumFromFile xs f =
  do False <- fEOF f
     | True => pure (Right xs)
     Right line <- fGetLine f
     | Left err => pure (Left "Failed to read line from file")
     if line == ""
     then sumFromFile xs f
     else case (xs, parseLine line) of
               ([],  Just xs2) => sumFromFile xs2 f
               (xs1, Just xs2) => if length xs1 == length xs2
                                  then sumFromFile (zipWith (+) xs1 xs2) f
                                  else pure (Left "Inconsistent-length rows")
               (_, Nothing) => pure (Left $ "Failed to parse line: " ++ line)

sumFromFileName : String -> IO (Either String (List Double))
sumFromFileName filename =
  do Right f <- openFile filename Read
     | Left err => pure (Left "Failed to open file")
     sumFromFile [] f

main : IO ()
main =
  do [_, filename] <- getArgs
     | _ => putStrLn "Exactly 1 filename must be given"
     Right result <- sumFromFileName filename
     | Left err => putStrLn err
     putStrLn (pack $ intercalate [','] $ map (unpack . show) $ result)

Not beautiful, and I had some problems getting this far. The first thing is that compile times are very, very slow. As soon as any I/O got involved, a simple program took 25 seconds or more to build. Surprisingly, almost all of the time was used by gcc, compiling the C output of the Idris compiler. The Idris type checker itself was fast enough that this at least didn’t affect the editing cycle too much.

The combination of monadic I/O and eager evaluation was also a problem for me. My test program needs to process lines from a file one at a time, without reading the whole file first. In an impure language like SML, this is easy: you can read a line in any function, without involving any I/O in the type signature of the function. In a pure language, you can only read in an I/O context. In a pure lazy language, you can read once in an I/O context and get back a lazily-evaluated list of all the lines in a file (Haskell has a prelude function called lines that does this), which makes toy examples like this simple without having to engage properly with your monad. In a pure eager language, it isn’t so simple: you have to actually “do the I/O” and permeate the I/O context through the program.

This conceptual difficulty was compounded by the book’s caginess about how to read from a file. It’s full of examples that read from the terminal, but the word “file” isn’t in the index. I have the impression this might be because the “right way” is to use a higher-level stream interface, but that interface maybe wasn’t settled enough to go into a textbook.

(As a non-Haskell programmer I find it hard to warm to the “do” syntax used as sugar for monadic I/O in Haskell and Idris. It is extremely ingenious, especially the way it allows alternation — see the lines starting with a pipe character in the above listing — as shorthand for handling error cases. But it troubles me, reminding me of working with languages that layered procedural sugar over Lisp, like the RLisp that the REDUCE algebra system was largely written in. Robert Harper once described Haskell as “the world’s best imperative programming language” and I can see where that comes from: if you spend all your time in I/O context, you don’t feel like you’re writing functional code at all.)

Anyway, let’s follow up my clumsy first attempt with a clumsy second attempt. This one makes use of what seems to be the “hello, world” of dependent typing, namely vectors whose length is part of their compile type. With this, I guess we can guarantee that the lowest-level functions, such as summing two vectors, are only called with the proper length arguments. That doesn’t achieve much for our program since we’re dealing directly with arbitrary-length user inputs anyway, but I can see it could be useful to bubble up error handling out of library code.

Here’s that second effort:

module Main

import Data.Vect
import Data.String

total
sumVects : Vect len Double -> Vect len Double -> Vect len Double
sumVects v1 v2 = 
  zipWith (+) v1 v2

total
parseFields : List String -> Maybe (List Double)
parseFields strs =
  foldr (\str, acc => case (parseDouble str, acc) of
                           (Just d, Just ds) => Just (d :: ds)
                           _ => Nothing)
        (Just []) strs

total
parseVect : String -> Maybe (len ** Vect len Double)
parseVect str =
  case parseFields $ Strings.split (== ',') str of
  Nothing => Nothing
  Just xs => Just (_ ** fromList xs)  

sumFromFile : Maybe (len ** Vect len Double) -> File -> 
              IO (Either String (len ** Vect len Double))
sumFromFile acc f =
  do False <- fEOF f
     | True => case acc of
               Nothing => pure (Right (_ ** []))
               Just v => pure (Right v)
     Right line <- fGetLine f
     | Left err => pure (Left "Failed to read line from file")
     if line == ""
     then sumFromFile acc f
     else case (acc, parseVect line) of
               (_, Nothing) => pure (Left $ "Failed to parse line: " ++ line)
               (Nothing, other) => sumFromFile other f
               (Just (len ** xs), Just (len' ** xs')) =>
                    case exactLength len xs' of
                         Nothing => pure (Left "Inconsistent-length rows")
                         Just xs' =>
                             sumFromFile (Just (len ** (sumVects xs xs'))) f

sumFromFileName : String -> IO (Either String (len ** Vect len Double))
sumFromFileName filename =
  do Right f <- openFile filename Read
     | Left err => pure (Left "Failed to open file")
     sumFromFile Nothing f

main : IO ()
main =
  do [_, filename] <- getArgs
     | _ => putStrLn "Exactly 1 filename must be given"
     Right (_ ** result) <- sumFromFileName filename
     | Left err => putStrLn err
     putStrLn (pack $ intercalate [','] $ map (unpack . show) $ toList result)

Horrible. This is complicated and laborious, at least twice as long as it should be, and makes me feel as if I’ve missed something essential about the nature of the problem. You can see that, as a programmer, I’m struggling a surprising amount here.

Both of these examples compile, run, and get the right answers. But when I tried them on a big input file, both versions took more than 8 minutes to process it — on a file that was processed in less than 30 seconds by each of the languages I tried out in for this post. A sampling profiler did not pick out any obvious single source of delay. Something strange is going on here.

All in all, not a wild success. That’s a pity, because:

Good Things

My failure to produce a nice program that worked efficiently didn’t entirely burst the bubble. I liked a lot in my first attempt to use Idris.

The language has a gloriously clean syntax. I know my examples show it in a poor light; compare instead this astonishingly simple typesafe implementation of a printf-like string formatting function, where format arguments are typechecked based on the content of the format string, something you can’t do at all in most languages. Idris tidies up a number of things from Haskell, and has a beautifully regular syntax for naming types.

Idris is also nice and straightforward, compared with other strongly statically-typed languages, when it comes to things like using arithmetic operators for different kinds of number, or mechanisms like “map” over different types. I assume this is down to the pervasive use of what in Haskell are known as type classes (here called interfaces).

I didn’t run into any problems using the compiler and tools. They were easy to obtain and run, and the compiler (version 1.1.1) gives very clear error messages considering how sophisticated some of the checks it does are. The compiler generally “feels” solid.

I’m not sold on the fact that these languages are whitespace-sensitive — life’s too short to indent code by hand — though this does contribute to the tidy syntax. Idris seems almost pathologically sensitive to indentation level, and a single space too many when indenting a “do” block can leave the compiler lost. But this turned out to be less of a problem in practice than I had expected.

Will I be using Idris for anything bigger? Not yet, but I’d like to get better at it. I’m going to keep an eye on it and drop in occasionally to try another introductory problem or two. Meanwhile if you’d like to improve on anything I’ve written here, please do post below.

MIREX 2017 submissions

For the fifth year in a row, this year the Centre for Digital Music submitted a number of Vamp audio analysis plugins to the MIREX evaluation for “music information retrieval” tasks. This year we submitted the same set of plugins as last year; there were no new implementations, and some of the existing ones are so old as to have celebrated their tenth birthday earlier in the year. So the goal is not to provide state-of-the-art results, but to give other methods a stable baseline for comparison and to check each year’s evaluation metrics and datasets against neighbouring years. I’ve written about this in each of the four previous years: see posts about 2016, 2015, 2014, and 2013.

Obviously, having submitted exactly the same plugins as last year, we expect basically the same results. But the other entries we’re up against will have changed, so here’s a review of how each category went.

(Note: we dropped one category this year, Audio Downbeat Estimation. Last year’s submission was not well prepared for reasons I touched on in last year’s post, and I didn’t find time to rework it.)

Structural Segmentation

Results for the four datasets are here, here, here, and here. Our results, for Segmentino from Matthias Mauch and the older QM Segmenter from Mark Levy, were the same as last year, with the caveat that the QM Segmenter uses random initialisation and so never gets exactly the same results twice.

Surprisingly, nobody else entered anything to this category this year, which seems a pity because it’s an interesting problem. This category seems to have peaked around 2012-2013.

Multiple Fundamental Frequency Estimation and Tracking

An exciting year for this mind-bogglingly difficult category, with 14 entries from ten different sets of authors and a straight fight between template decomposition methods (including our Silvet plugin, from Emmanouil Benetos’s work) and trendy convolutional neural networks. Results are here and here.

With so many entries and evaluations it’s not that easy to get a clear picture, and no single method appears to be overwhelmingly strong. There were fine results in some evaluations for CNN methods from Thickstun et al and Thomé and Ahlbäck, for Pogorelyuk and Rowley‘s very intriguing “Dynamic Mode Decomposition”, and for a few others whose abstracts are missing from the entry site and so can’t be linked to.

Silvet, with the same results as last year, does well enough to be interesting, but in most cases it isn’t troubling the best of the newer methods.

Audio Onset Detection

Bit of a puzzle here, as our two plugin submissions both got slightly different results from last year despite being unchanged implementations of deterministic methods invoked in the same way on the same data sets.

Last year saw a big expansion in the number of entries, and this year there were nearly as many. Just as last year, our old plugins did modestly, but again some of the new experiments fared a bit less well so we weren’t quite at the bottom. Results here.

Audio Beat Tracking

Same puzzle as in onset detection: while our results were basically similar to last year, they weren’t identical. The 2015 and 2016 results were identical and we would have expected the same again in 2017.

That apart, there’s little to report since last year. Results are here, here, and here.

Audio Tempo Estimation

Last year there were two entries in this category, ours and a much stronger one from Sebastian Böck. This year sees one addition, from Hendrick Schreiber and Meinard Müller, which fares creditably. The results are here.

Audio Key Detection

Two pretty successful new submissions this year, both using convolutional neural networks: one from Korzeniowski, Böck, Krebs and Widmer, and the other from Hendrik Schreiber. Our old plugin (from work by Katy Noland) does not fare tragically, but it’s clear that some other methods are getting much closer to the sort of performance one imagines should be realistic. The results are linked from here.

Intuitively, key estimation seems like the sort of problem that is interesting only so long as you don’t have enough training data. As a 24-way classification with large enough training datasets, it looks a bit mundane. The problem becomes, what does it mean for a piece of music to be in a particular key anyway? Submissions are not expected to answer that, but presumably it sets an upper bound on performance.

Audio Chord Estimation

Another increase in the number of test datasets, from 5 to 7, and a strong category again. Last year our submission Chordino (by Matthias Mauch) was beginning to trail, though it wasn’t quite at the back. This year some of the weaker submissions have not been repeated, some new entries have appeared, and Chordino is in last place for every evaluation. It’s not far behind — perceptually it’s still a pretty good algorithm — but some of the other methods are very impressive now. Here are the results.

The abstracts accompanying the two submissions from the audio information processing group at Fudan University in Shanghai (Jiang, Li and Wu and Wu, Feng and Li) are both well worth a read. The former paper refers closely to Chordino, using the same NNLS Chroma features with a new front-end. Meanwhile, the latter paper proposes a method worth remembering for dinner parties, using deep residual networks trained from MIDI-synchronised constant-Q representations of audio with a bidirectional long-short-term memory and conditional random field for labelling.