# A film camera

I take a lot of photos and I share some of them online via the antique medium of Flickr. Not many people look at them, which I don’t mind, because I imagine my audience to be (a) family and (b) myself, later. Photos I take with people in them are usually visible only to my friends and family. I’m a person who takes photographs, not a photographer.

But I do take some joy in the practice of photography. That’s partly because I can: at my level, there is very little to it: somebody else has done all the hard work. There is massive, long-term, highly technically sophisticated labour behind every functional detail of image capture and reproduction, which all culminates, for routine takers-of-photos like me, in pressing a button or tapping the screen and deciding whether you like the resulting image or not. It’s a ritual that has delivered a spurious feeling of creativity to people for decades, prefiguring the internet age.

There are four categories of potential joy in a photo, and they go in this order:

1. Looking at whatever it is you’re thinking of taking a photo of
2. Solving technical problems, or just fiddling with the camera
3. Enjoying the picture itself
4. Finding the pic again later and reminiscing

Obviously, photos of your friends in the pub can skip categories 1 (except in a social sense) and 2. Very deliberate landscape photos might have a lot of categories 1, 2, and 3 but not a great deal of category 4. Please understand that I am vaguely blathering about this category thing because it seemed to make sense while I was typing this, not because I think it’s any kind of real system.

I started out taking photos on film, then moved to a digital camera in 2003. By then a digital camera already gave you more pleasure in the likely quality and serendipity of your pictures, good for categories 3 and 4 above. I did keep a film SLR — a Zenit EM, the cheapest second-hand SLR available when I bought it in 1991 — but it’s very clumsy to use, and the category-2 joy you would imagine getting from it never really materialised.

In search of that sort of joy, I recently bought a slightly fancier second-hand film camera from someone on eBay. I wanted something still mostly manual, but more like the kind of thing I never had access to when young. I decided to buy a Minolta, and I’m not ashamed to say that was mostly because I like the old Minolta logo, before they introduced the more familiar Saul Bass-designed all-caps logotype in 1981. The older logo is verging on Comic Sans in its friendliness, and gives the camera a sweet face:

This manual-focus, manual-aperture, automatic exposure Minolta XG-9 dates from about 1980. It was the cheaper of Minolta’s two SLR ranges at the time; the body probably cost slightly more than a 48K ZX Spectrum home computer. It would have been a nice and very practical camera. It also embodies a mind-boggling amount of mechanical complexity compared with modern equipment. To the right is a schematic of the winder mechanism, one of dozens of such illustrations found in the service manual.

It was sold to me with a lens at least a decade older than the camera, a splendid-looking chunk of metal with an impressive (and apparently rather 1960s) wide front glass element.

Anyway, I bought this for my category 2, the fiddling. But so far what I’ve appreciated most is the thing I gave as category 1: just looking at the real scene more closely. Without a screen to review photos on, you have to assume that your photo will fail and you will never get to see this again. If an image should appear again, days later when you get the film processed, it’s a fresh delight.

The downside is that the economics of successful photos still apply. That is, only one in ten shots is any good. With a smartphone or digital camera you can take a hundred photos and have ten you really like. With film you buy a 36-photo roll, get three or four decent photos, but have to visit the print shop twice and spend at least £15 buying and developing the film. (Though I was surprised to find that you can still get film processed at Snappy Snaps.)

And how are the photos? Well, it’s a bit like listening to vinyl records. It’s nice for things that benefit from a bit of roughness and vigour, like this kind of thing:

(That one wanted a lens hood to prevent the flares in the middle and right, but I didn’t have one at the time.)

Or for snaps of people:

I like both of those a lot, but I’ve yet to get any really successful landscape or “still-life” pics from it and I suspect I never will, now that I’m used to a cleaner, higher resolution digital image.

Will I be using it much? Am I going to carry it around everywhere, but take far fewer and more selective photos than I otherwise might? Probably not, but it might not be up to me anyway. These are fairly solid cameras, but this one is nearly 40 years old and has a few electronic bits as well as sensitive mechanical parts. They do fail in various ways and I don’t entirely trust that it’s going to be still working the next time I want to use it. That primitive but sturdy Zenit will probably have the last laugh.

# What does a convolutional neural net actually do when you run it?

Convolutional neural networks (or convnets or CNNs) are a staple of “deep learning”. There are many tutorials available that describe what they do, either mathematically or via quasi-mystical appeals to intuition, and introduce how to train and use them, often with image classification examples.

This post has a narrower focus. As a programmer, I am happy processing floating-point data in languages like C++, but I’m more likely to be building applications with pre-trained models than training new ones. Say I have a pre-trained convnet image classifier, and I use it to carry out a single classification of one image (“tell me whether this shows a pig, cow, or sheep”) — what does that actually mean, in terms of code?

I’ll go through an example in the form of a small hand-written C++ convolutional net that contains only the “execution” code, no training logic. It will be very much an illustrative implementation, not production code. It will use pre-trained data adapted from a similar model in Keras, on which more later.

The specific example I am using is a typical image classification one: identifying which of five types of flower is visible in a small colour image (inspired by this tutorial). All the code I’m describing can be found in this Github repository.

### Networks, layers, weights, training

A network in this context is a pipeline of functions through which data is passed, with the output of each function going directly to the input of the next one. Each of these functions is known as a layer.

Our example will begin by supplying the image data as input to the first layer, and end with the final layer returning a set of five numbers, one for each kind of flower the network knows about, indicating how likely the network thinks it is that the image shows that kind of flower.

So to run one classification, we just have to get the image data into the right format and pass it through each layer function in turn, and out pops the network’s best guess.

The layers themselves perform some mathematical operation on the data. Some of them do so by making use of a set of other values, provided separately, which they might multiply with the inputs in some way (in which case they are known as weights) or add to the return values (known as biases).

Coming up with suitable weight and bias values for these layers is what training is for. To train a network, the weights and biases for each such layer are randomly initialised, then the network is repeatedly run to provide classifications of known inputs; each of the network’s guesses is compared with the known class of the input, and the weights and biases of the network are updated depending on how accurate the classification was. Eventually, with luck, they will converge on some reliable values. Training is both difficult and tedious, so here we will happily leave it to machine-learning researchers.

These trainable layers are then typically sandwiched by fixed layers that provide non-linearities (also known as activation functions) and various data-manipulation bits and bobs like max-pooling. More on these later.

The layers are pure functions: a layer with a given set of weights and biases will always produce the same output for a given input. This is not always true within recurrent networks, which model time-varying data by maintaining state in the layers, but it is true of classic convolutional nets.

### Our example network

This small network consists of four rounds of convolution + max-pooling layers, then two dense layers.

I constructed and trained this using Keras (in Python) based on a “flower pictures” dataset downloaded from the Tensorflow site, with 5 labelled classes of flowers. My Github repository contains a script obtain-data.sh which downloads and prepares the images, and a Python program with-keras/train.py which trains this network on them and exports the trained weights into a C++ file. The network isn’t a terribly good classifier, but that doesn’t bother me here.

I then re-implemented the model in C++ without using a machine-learning library. Here’s what the pipeline looks like. This can be found in the file flower.cpp in the repository. All of the functions called here are layer functions defined elsewhere in the same file, which I’ll go through in a moment. The variables with names beginning weights_ or biases_ are the trained values exported from Keras. Their definitions can be found in weights.cpp.

vector<float>
classify(const vector<vector<vector<float>>> &image)
{
vector<vector<vector<float>>> t;

t = convolve(t, weights_firstConv, biases_firstConv);
t = activation(t, "relu");
t = maxPool(t, 2, 2);

t = convolve(t, weights_secondConv, biases_secondConv);
t = activation(t, "relu");
t = maxPool(t, 2, 2);

t = convolve(t, weights_thirdConv, biases_thirdConv);
t = activation(t, "relu");
t = maxPool(t, 2, 2);

t = convolve(t, weights_fourthConv, biases_fourthConv);
t = activation(t, "relu");
t = maxPool(t, 2, 2);

vector<float> flat = flatten(t);

flat = dense(flat, weights_firstDense, biases_firstDense);
flat = activation(flat, "relu");

flat = dense(flat, weights_labeller, biases_labeller);
flat = activation(flat, "softmax");

return flat;
}


Many functions are used for more than one layer — for example we have a single convolve function used for four different layers, with different weights, bias values, and input shapes each time. This reuse is possible because layer functions don’t retain any state from one call to the next.

You can see that we pass the original image as input to the first layer, but subsequently we just take the return value from each function and pass it to the next one. These values have types that we are expressing as nested vectors of floats. They are all varieties of tensor, on which let me digress for a moment:

### Tensors

For our purposes a tensor is a multidimensional array of numbers, in our case floats, of which vectors and matrices are special cases. A tensor has a shape, which we can write as a list of sizes, and the length of that list is the rank. (The word dimension can be ambiguous here and I’m going to try to avoid it… apart from that one time I used it a moment ago…)

Examples:

• A matrix is a rank-2 tensor whose shape has two values, height and width. The matrix $\begin{bmatrix}1.0&3.0&5.0\\2.0&4.0&6.0\end{bmatrix}$ has shape [2, 3], if we are giving the height first.
• A C++ vector of floats is a rank-1 tensor, whose shape is a list of one value, the number of elements in the vector. The vector { 1.0, 2.0, 3.0 } has the shape [3].
• A single number can also be considered a tensor, of rank 0, with shape [].

All of the values passed to and returned from our network layer functions are tensors of various shapes. For example, a colour image will be represented as a tensor of rank 3, as it has height, width, and a number of colour channels.

In this code, we are storing tensors using C++ vectors (for rank 1), vectors of vectors (for rank 2), and so on. This is transparent, makes indexing easy, and allows us to see the rank of each tensor directly from its type.

Production C++ frameworks don’t do it this way — they typically store tensors as values interleaved into a single big array, wrapped up in a tensor class that knows how to index into it. With such a design, unlike our code, all tensors will have the same C++ class type regardless of their rank.

There is a practical issue with the ordering of tensor indices. Knowing the shape of a tensor is not enough: we also have to know which index is which. For example, an RGB image that is 128 pixels square has shape [128, 128, 3] if we index the height, width, and colour channels in that order, or [3, 128, 128] if we separate out the individual colour channels and index them first. Unfortunately, both layouts are in common use. Keras exposes the choice through a flag and uses the names channels_last for the former layout, historically the default in Tensorflow as it often runs faster on CPUs, and channels_first for the latter, used by many systems as it parallelises better with GPUs. The code in this example uses channels_last ordering.

This definition of a tensor is good enough for us, but in mathematics a tensor is not just an array of numbers but an object that sits in an algebraic space and can be manipulated in ways intrinsic to that space. Our data structures make no attempt to capture this, just as a C++ vector makes no effort to capture the properties of a mathematical vector space.

### Layers used in our model

Each layer function takes a tensor as input, which I refer to within the function as in. Trained layers also take further tensors containing the weights and biases for the layer.

#### Convolution layer

Here’s the nub of this one, implemented in the convolve function.

for (size_t k = 0; k < nkernels; ++k) {
for (size_t y = 0; y < out_height; ++y) {
for (size_t x = 0; x < out_width; ++x) {
for (size_t c = 0; c < depth; ++c) {
for (size_t ky = 0; ky < kernel_height; ++ky) {
for (size_t kx = 0; kx < kernel_width; ++kx) {
out[y][x][k] +=
weights[ky][kx][c][k] *
in[y + ky][x + kx][c];
}
}
}
out[y][x][k] += biases[k];
}
}
}


Seven nested for-loops! That’s a lot of brute force.

The weights for this layer, supplied in a rank-4 tensor, define a set of convolution kernels (or filters). Each kernel is a rank-3 tensor, width-height-depth. For each pixel in the input, and for each channel in the colour depth, the kernel values for that channel are multiplied by the surrounding pixel values, depending on the width and height of the kernel, and summed into a single value which appears at the same location in the output. We therefore get an output tensor containing a matrix of (almost) the same size as the input image, for each of the kernels.

(Note for audio programmers: FFT-based fast convolution is not generally used here, as it works out slower for small kernels)

Vaguely speaking, this has the effect of transforming the input into a space determined by how much each pixel’s surroundings have in common with each of the learned kernel patterns, which presumably capture things like whether pixels have a common colour or whether there is common activity between a pixel and neighbouring pixels in a particular direction due to edges or lines in the image.

Usually the early convolution layers in an image classifier will explode the amount of data being passed through the network. In our small model, the first convolution layer takes in a 128×128 image with 3 colour channels, and returns a 128×128 output for each of 32 learned kernels: nearly ten times as much data, which will then be reduced gradually through later layers. But fewer weight values are needed to describe a convolution layer compared with the dense layers we’ll see later in our network, so the size of the trained layer (what you might call the “code size”) is relatively smaller.

I said above that the output matrices were “almost” the same size as the input. They have to be a bit smaller, as otherwise the kernel would overrun the edges. To allow for this, we precede the convolution layer with a…

This can be found in the zeroPad function. It creates a new tensor filled with zeros and copies the input into it:

for (size_t y = 0; y < in_height; ++y) {
for (size_t x = 0; x < in_width; ++x) {
for (size_t c = 0; c < depth; ++c) {
}
}
}


All this does is add a zero-valued border around all four edges of the image, in each channel, in order to make the image sufficiently bigger to allow for the extent of the convolution kernel.

Most real-world implementations of convolution layers have an option to pad the input when they do the convolution, avoiding the need for an explicit zero-padding step. In Keras for example the convolution layers have a padding parameter that can be either valid (no extra padding) or same (provide padding so the output and input sizes match). But I’m trying to keep the individual layer functions minimal, so I’m happy to separate this out.

#### Activation layer

Again, this is something often carried out in the trained layers themselves, but I’ve kept it separate to simplify those.

This introduces a non-linearity by applying some simple mathematical function to each value (independently) in the tensor passed to it.

Historically, for networks without convolution layers, the activation function has usually been some kind of sigmoid function mapping a linear input onto an S-shaped curve, retaining small differences in a critical area of the input range and flattening them out elsewhere.

The activation function following a convolution layer is more likely to be a simple rectifier. This gives us the most disappointing acronym in all of machine learning: ReLU, which stands for “rectified linear unit” and means nothing more exciting than

if (x < 0.0) {
x = 0.0;
}


applied to each value in the tensor.

We have a different activation function right at the end of the network: softmax. This is a normalising function used to produce final classification estimates that resemble probabilities summing to 1:

float sum = 0.f;
for (size_t i = 0; i < sz; ++i) {
out[i] = exp(out[i]);
sum += out[i];
}
if (sum != 0.f) {
for (size_t i = 0; i < sz; ++i) {
out[i] /= sum;
}
}


#### Max pooling layer

The initial convolution layer increases the amount of data in the network, and subsequent layers then reduce this into a smaller number of hopefully more meaningful higher-level values. Max pooling is one way this reduction happens. It reduces the resolution of its input in the height and width axes, by taking only the maximum of each NxM block of pixels, for some N and M which in our network are both always 2.

From the function maxPool:

for (size_t y = 0; y < out_height; ++y) {
for (size_t x = 0; x < out_width; ++x) {
for (size_t i = 0; i < pool_y; ++i) {
for (size_t j = 0; j < pool_x; ++j) {
for (size_t c = 0; c < depth; ++c) {
float value = in[y * pool_y + i][x * pool_x + j][c];
out[y][x][c] = max(out[y][x][c], value);
}
}
}
}
}


#### Flatten layer

Interleaves a rank-3 tensor into a rank-1 tensor, otherwise known as a single vector.

for (size_t y = 0; y < height; ++y) {
for (size_t x = 0; x < width; ++x) {
for (size_t c = 0; c < depth; ++c) {
out[i++] = in[y][x][c];
}
}
}


Why? Because that’s what the following dense layer expects as its input. By this point we are saying that we have used as much of the structure in the input as we’re going to use, to produce a set of values that we now treat as individual features rather than as a structure.

With a library that represents tensors using an interleaved array, this will be a no-op, apart from changing the nominal rank of the tensor.

#### Dense layer

A dense or fully-connected layer is an old-school neural network layer. It consists of a single matrix multiplication, multiplying the input vector by the matrix of learned weights, then a vector addition of bias values. From our dense function:

vector<float> out(out_size, 0.f);

for (size_t i = 0; i < in_size; ++i) {
for (size_t j = 0; j < out_size; ++j) {
out[j] += weights[i][j] * in[i];
}
}

for (size_t j = 0; j < out_size; ++j) {
out[j] += biases[j];
}


Our network has two of these layers, the second of which produces the final prediction values.

That’s pretty much it for the code. There is a bit of boilerplate around each function, and some extra code to load in an image file using libpng, but that’s about all.

There is one other kind of layer that appears in the original Keras model: a dropout layer. This doesn’t appear in our code because it is only used while training. It discards a certain proportion of the values passed to it as input, something that can help to make the following layer more robust to error when trained.

## Further notes

### Precision and repeatability

Training a network is a process that involves randomness. A given model architecture can produce quite different results on separate training runs.

This is not the case when running a trained network to make a classification or prediction. Two implementations of a network that both use the same basic arithmetic datatypes (e.g. 32-bit floats) should produce the same results when given the same set of trained weights and biases. If you are trying to reproduce a network’s behaviour using a different library or framework, and you have the same trained weights to use with both implementations, and your results look only “sort of” similar, then they’re probably wrong.

The differences between implementations in terms of channel ordering and indexing can be problematic in this respect. Consider the convolution function with its seven nested for-loops. Load any of those weights with the wrong index or in the wrong order and of course it won’t work, and there are a lot of possible permutations there. There seems to be more than one opinion about how kernel weights should be indexed, for example, and it’s easy to get them upside-down if you’re converting from one framework to another.

### Should I use code like this in production?

Probably not!

First and most trivially, this isn’t a very efficient arrangement of layers. Separating out the zero-padding and activation functions means more memory allocation and copying.

Second, there are many ways to speed up the expensive brute-force layers (that is, the convolution and dense layers) dramatically even on the CPU, using vectorisation, careful cache and memory management, and faster matrix and convolution algorithms. If you don’t take advantage of these, you’re wasting time for no purpose; if you do, you’re repeating work other people are doing in libraries already.

I adapted the example network to use the tiny-dnn header-only C++ library, and it ran roughly twice as fast as the code above. (I’m a sucker for libraries named tiny-something, even if, as in this case, they are no longer all that tiny.) That version of the example code can be found in the with-tiny-dnn directory in the repo. It’s ugly, because I had to load and convert the weights from a system that uses channels-last format into one that expects channels-first. But I only had to write that code once, and it wouldn’t have been necessary if I’d trained the model in a channels-first layout. I might also have overlooked some simpler way to do it.

Third, this approach is fragile in the face of changes to the model architecture, which have to be duplicated exactly across the model implementation and the separately-loaded weights. It would be preferable to load the model architecture into your program, rather than load the weights into a hand-written architecture. To this end it seems to make sense to use a library that supports importing and exporting models automatically.

All this said, I find it highly liberating to realise that one could just sketch out these few lines of code and have a working result.

# Notes on Idris

Idris Book cover

In a bid to expand my programming brain by learning something about “dependent types”, I recently bought the Idris book.

(Idris is a pure functional programming language that is mostly known for supporting dependent types. Not knowing what that really meant, and seeing that this recently-published book written by the author of the language was warmly reviewed on Amazon, I saw an opportunity.)

The idea behind dependent typing is to allow types to be declared as having dependencies on information that would, in most languages, only be known at runtime. For example, a function might accept two arrays as arguments, but only work correctly if the two arrays that are actually passed to it have the same length: with dependent types, this dependency can be written into the type declaration and checked at compile time. On the face of it we can’t in general know things like the number of elements in an arbitrary array at compile time, so this seems like compelling magic. Types are, in fact, first class values, and if that raises a lot of horribly recursive-seeming questions in your mind, then this book might help.

A number of academic languages are exploring this area, but Idris is interesting because of its ambition to be useful for general-purpose programming. The book is pleasingly straightforward about this: it adopts an attitude that “this is how we program, and I’ll help you do it”. In fact, the Idris book might be the most inspiring programming book I’ve read since ML for the Working Programmer (although the two are very different, more so than their synopses would suggest). It’s not that the ideas in it are new — although they are, to me — but that it presents them so crisply as to project a tantalising image of a better practice of programming.

### What the book suggests

The principles in the book — as I read it — are:

• We always write down the type of a function before writing the function. Because the type system is so expressive, this gives the compiler, and the programmer, an awful lot of information — sometimes so much that there remains only one obvious way to satisfy the type requirements when writing the rest of the function. (The parallel with test-driven development, which can be a useful thinking aid when you aren’t sure how to address a problem, presumably inspired the title of the book.)
• Sometimes, we don’t even implement the rest of the function yet. Idris supports a lovely feature called “holes”, which are just names you plonk down in place of code you haven’t got around to writing yet. The compiler will happily type-check the rest of the program as if the holes were really there, and then report the holes and their types to you to give you a checklist of the bits you still need to fill in.
• Functions are pure, as in they simply convert their input values to their return values and don’t modify any other state as they go along, and are if possible total, meaning that every input has a corresponding return value and the function won’t bail out or enter an infinite loop. The compiler will actually check that your functions are total — the Halting Problem shows that this can’t be done in general, but apparently it can be done enough of the time to be useful. The prospect of having a compiler confirm that your program cannot crash is exciting even to someone used to Standard ML, where typing is generally sound but inexhaustive cases and range errors still happen.
• Because functions are pure, all input and output uses monads. That means that I/O effects are encapsulated into function return types. The book gives a supremely matter-of-fact introduction to monadic I/O, without using the word monad.
• We use an IDE which supports all this with handy shortcuts — there is an impressive Idris integration for Atom which is used extensively throughout the book. For an Emacs user like me this is slightly offputting. There is an Emacs Idris mode as well, it’s just that so much of the book is expressed in terms of keystrokes in Atom.

None of these principles refers to dependent types at all, but the strength of the type system is what makes it plausible that this could all work. Idris does a lot of evaluation at compile time, to support the type system and totality checking, so you always have the feeling of having run your program a few times before it has even compiled.

So, I enjoyed the book. To return to earth-based languages for a moment, it reminds me a bit of Bjarne Stroustrup’s “A Tour of C++”, which successfully sets out “modern C++” as if most of the awkward history of C++ had never happened. One could see this book as setting out a “modern Haskell” as if the actual Haskell had not happened. Idris is quite closely based on Haskell, and I think, as someone who has never been a Haskell programmer, that much of the syntax and prelude library are the same. The biggest difference is that Idris uses eager evaluation, where Haskell is lazily evaluated. (See 10 things Idris improved over Haskell for what appears to be an informed point of view.)

### Then what happened?

How did I actually get on with it? After absorbing some of this, I set out to write my usual test case (see Four MLs and a Python) of reading a CSV file of numbers and adding them up. The goal is to read a text file, split each line into number columns, sum the columns across all lines, then print out a single line of comma-separated sums — and to do it in a streaming fashion without reading the whole file into memory.

Here’s my first cut at it. Note that this doesn’t actually use dependent types at all.

module Main

import Data.String

parseFields : List String -> Maybe (List Double)
parseFields strs =
foldr (\str, acc => case (parseDouble str, acc) of
(Just d, Just ds) => Just (d :: ds)
_ => Nothing)
(Just []) strs

parseLine : String -> Maybe (List Double)
parseLine str =
parseFields $Strings.split (== ',') str sumFromFile : List Double -> File -> IO (Either String (List Double)) sumFromFile xs f = do False <- fEOF f | True => pure (Right xs) Right line <- fGetLine f | Left err => pure (Left "Failed to read line from file") if line == "" then sumFromFile xs f else case (xs, parseLine line) of ([], Just xs2) => sumFromFile xs2 f (xs1, Just xs2) => if length xs1 == length xs2 then sumFromFile (zipWith (+) xs1 xs2) f else pure (Left "Inconsistent-length rows") (_, Nothing) => pure (Left$ "Failed to parse line: " ++ line)

sumFromFileName : String -> IO (Either String (List Double))
sumFromFileName filename =
do Right f <- openFile filename Read
| Left err => pure (Left "Failed to open file")
sumFromFile [] f

main : IO ()
main =
do [_, filename] <- getArgs
| _ => putStrLn "Exactly 1 filename must be given"
Right result <- sumFromFileName filename
| Left err => putStrLn err
putStrLn (pack $intercalate [',']$ map (unpack . show) $result)  Not beautiful, and I had some problems getting this far. The first thing is that compile times are very, very slow. As soon as any I/O got involved, a simple program took 25 seconds or more to build. Surprisingly, almost all of the time was used by gcc, compiling the C output of the Idris compiler. The Idris type checker itself was fast enough that this at least didn’t affect the editing cycle too much. The combination of monadic I/O and eager evaluation was also a problem for me. My test program needs to process lines from a file one at a time, without reading the whole file first. In an impure language like SML, this is easy: you can read a line in any function, without involving any I/O in the type signature of the function. In a pure language, you can only read in an I/O context. In a pure lazy language, you can read once in an I/O context and get back a lazily-evaluated list of all the lines in a file (Haskell has a prelude function called lines that does this), which makes toy examples like this simple without having to engage properly with your monad. In a pure eager language, it isn’t so simple: you have to actually “do the I/O” and permeate the I/O context through the program. This conceptual difficulty was compounded by the book’s caginess about how to read from a file. It’s full of examples that read from the terminal, but the word “file” isn’t in the index. I have the impression this might be because the “right way” is to use a higher-level stream interface, but that interface maybe wasn’t settled enough to go into a textbook. (As a non-Haskell programmer I find it hard to warm to the “do” syntax used as sugar for monadic I/O in Haskell and Idris. It is extremely ingenious, especially the way it allows alternation — see the lines starting with a pipe character in the above listing — as shorthand for handling error cases. But it troubles me, reminding me of working with languages that layered procedural sugar over Lisp, like the RLisp that the REDUCE algebra system was largely written in. Robert Harper once described Haskell as “the world’s best imperative programming language” and I can see where that comes from: if you spend all your time in I/O context, you don’t feel like you’re writing functional code at all.) Anyway, let’s follow up my clumsy first attempt with a clumsy second attempt. This one makes use of what seems to be the “hello, world” of dependent typing, namely vectors whose length is part of their compile type. With this, I guess we can guarantee that the lowest-level functions, such as summing two vectors, are only called with the proper length arguments. That doesn’t achieve much for our program since we’re dealing directly with arbitrary-length user inputs anyway, but I can see it could be useful to bubble up error handling out of library code. Here’s that second effort: module Main import Data.Vect import Data.String total sumVects : Vect len Double -> Vect len Double -> Vect len Double sumVects v1 v2 = zipWith (+) v1 v2 total parseFields : List String -> Maybe (List Double) parseFields strs = foldr (\str, acc => case (parseDouble str, acc) of (Just d, Just ds) => Just (d :: ds) _ => Nothing) (Just []) strs total parseVect : String -> Maybe (len ** Vect len Double) parseVect str = case parseFields$ Strings.split (== ',') str of
Nothing => Nothing
Just xs => Just (_ ** fromList xs)

sumFromFile : Maybe (len ** Vect len Double) -> File ->
IO (Either String (len ** Vect len Double))
sumFromFile acc f =
do False <- fEOF f
| True => case acc of
Nothing => pure (Right (_ ** []))
Just v => pure (Right v)
Right line <- fGetLine f
| Left err => pure (Left "Failed to read line from file")
if line == ""
then sumFromFile acc f
else case (acc, parseVect line) of
(_, Nothing) => pure (Left $"Failed to parse line: " ++ line) (Nothing, other) => sumFromFile other f (Just (len ** xs), Just (len' ** xs')) => case exactLength len xs' of Nothing => pure (Left "Inconsistent-length rows") Just xs' => sumFromFile (Just (len ** (sumVects xs xs'))) f sumFromFileName : String -> IO (Either String (len ** Vect len Double)) sumFromFileName filename = do Right f <- openFile filename Read | Left err => pure (Left "Failed to open file") sumFromFile Nothing f main : IO () main = do [_, filename] <- getArgs | _ => putStrLn "Exactly 1 filename must be given" Right (_ ** result) <- sumFromFileName filename | Left err => putStrLn err putStrLn (pack$ intercalate [','] $map (unpack . show)$ toList result)


Horrible. This is complicated and laborious, at least twice as long as it should be, and makes me feel as if I’ve missed something essential about the nature of the problem. You can see that, as a programmer, I’m struggling a surprising amount here.

Both of these examples compile, run, and get the right answers. But when I tried them on a big input file, both versions took more than 8 minutes to process it — on a file that was processed in less than 30 seconds by each of the languages I tried out in for this post. A sampling profiler did not pick out any obvious single source of delay. Something strange is going on here.

All in all, not a wild success. That’s a pity, because:

### Good Things

My failure to produce a nice program that worked efficiently didn’t entirely burst the bubble. I liked a lot in my first attempt to use Idris.

The language has a gloriously clean syntax. I know my examples show it in a poor light; compare instead this astonishingly simple typesafe implementation of a printf-like string formatting function, where format arguments are typechecked based on the content of the format string, something you can’t do at all in most languages. Idris tidies up a number of things from Haskell, and has a beautifully regular syntax for naming types.

Idris is also nice and straightforward, compared with other strongly statically-typed languages, when it comes to things like using arithmetic operators for different kinds of number, or mechanisms like “map” over different types. I assume this is down to the pervasive use of what in Haskell are known as type classes (here called interfaces).

I didn’t run into any problems using the compiler and tools. They were easy to obtain and run, and the compiler (version 1.1.1) gives very clear error messages considering how sophisticated some of the checks it does are. The compiler generally “feels” solid.

I’m not sold on the fact that these languages are whitespace-sensitive — life’s too short to indent code by hand — though this does contribute to the tidy syntax. Idris seems almost pathologically sensitive to indentation level, and a single space too many when indenting a “do” block can leave the compiler lost. But this turned out to be less of a problem in practice than I had expected.

Will I be using Idris for anything bigger? Not yet, but I’d like to get better at it. I’m going to keep an eye on it and drop in occasionally to try another introductory problem or two. Meanwhile if you’d like to improve on anything I’ve written here, please do post below.

# MIREX 2017 submissions

For the fifth year in a row, this year the Centre for Digital Music submitted a number of Vamp audio analysis plugins to the MIREX evaluation for “music information retrieval” tasks. This year we submitted the same set of plugins as last year; there were no new implementations, and some of the existing ones are so old as to have celebrated their tenth birthday earlier in the year. So the goal is not to provide state-of-the-art results, but to give other methods a stable baseline for comparison and to check each year’s evaluation metrics and datasets against neighbouring years. I’ve written about this in each of the four previous years: see posts about 2016, 2015, 2014, and 2013.

Obviously, having submitted exactly the same plugins as last year, we expect basically the same results. But the other entries we’re up against will have changed, so here’s a review of how each category went.

(Note: we dropped one category this year, Audio Downbeat Estimation. Last year’s submission was not well prepared for reasons I touched on in last year’s post, and I didn’t find time to rework it.)

### Structural Segmentation

Results for the four datasets are here, here, here, and here. Our results, for Segmentino from Matthias Mauch and the older QM Segmenter from Mark Levy, were the same as last year, with the caveat that the QM Segmenter uses random initialisation and so never gets exactly the same results twice.

Surprisingly, nobody else entered anything to this category this year, which seems a pity because it’s an interesting problem. This category seems to have peaked around 2012-2013.

### Multiple Fundamental Frequency Estimation and Tracking

An exciting year for this mind-bogglingly difficult category, with 14 entries from ten different sets of authors and a straight fight between template decomposition methods (including our Silvet plugin, from Emmanouil Benetos’s work) and trendy convolutional neural networks. Results are here and here.

With so many entries and evaluations it’s not that easy to get a clear picture, and no single method appears to be overwhelmingly strong. There were fine results in some evaluations for CNN methods from Thickstun et al and Thomé and Ahlbäck, for Pogorelyuk and Rowley‘s very intriguing “Dynamic Mode Decomposition”, and for a few others whose abstracts are missing from the entry site and so can’t be linked to.

Silvet, with the same results as last year, does well enough to be interesting, but in most cases it isn’t troubling the best of the newer methods.

### Audio Onset Detection

Bit of a puzzle here, as our two plugin submissions both got slightly different results from last year despite being unchanged implementations of deterministic methods invoked in the same way on the same data sets.

Last year saw a big expansion in the number of entries, and this year there were nearly as many. Just as last year, our old plugins did modestly, but again some of the new experiments fared a bit less well so we weren’t quite at the bottom. Results here.

### Audio Beat Tracking

Same puzzle as in onset detection: while our results were basically similar to last year, they weren’t identical. The 2015 and 2016 results were identical and we would have expected the same again in 2017.

That apart, there’s little to report since last year. Results are here, here, and here.

### Audio Tempo Estimation

Last year there were two entries in this category, ours and a much stronger one from Sebastian Böck. This year sees one addition, from Hendrick Schreiber and Meinard Müller, which fares creditably. The results are here.

### Audio Key Detection

Two pretty successful new submissions this year, both using convolutional neural networks: one from Korzeniowski, Böck, Krebs and Widmer, and the other from Hendrik Schreiber. Our old plugin (from work by Katy Noland) does not fare tragically, but it’s clear that some other methods are getting much closer to the sort of performance one imagines should be realistic. The results are linked from here.

Intuitively, key estimation seems like the sort of problem that is interesting only so long as you don’t have enough training data. As a 24-way classification with large enough training datasets, it looks a bit mundane. The problem becomes, what does it mean for a piece of music to be in a particular key anyway? Submissions are not expected to answer that, but presumably it sets an upper bound on performance.

### Audio Chord Estimation

Another increase in the number of test datasets, from 5 to 7, and a strong category again. Last year our submission Chordino (by Matthias Mauch) was beginning to trail, though it wasn’t quite at the back. This year some of the weaker submissions have not been repeated, some new entries have appeared, and Chordino is in last place for every evaluation. It’s not far behind — perceptually it’s still a pretty good algorithm — but some of the other methods are very impressive now. Here are the results.

The abstracts accompanying the two submissions from the audio information processing group at Fudan University in Shanghai (Jiang, Li and Wu and Wu, Feng and Li) are both well worth a read. The former paper refers closely to Chordino, using the same NNLS Chroma features with a new front-end. Meanwhile, the latter paper proposes a method worth remembering for dinner parties, using deep residual networks trained from MIDI-synchronised constant-Q representations of audio with a bidirectional long-short-term memory and conditional random field for labelling.

# Notes from the Audio Developer Conference

I’ve spent the last couple of days at the 2017 Audio Developer Conference organised by ROLI. This is a get-together and technical conference for people who work on audio software and software-driven-hardware, in practice mostly people working on music applications.

I don’t go to many conferences these days, despite working in academia. I don’t co-write many papers and I’m no longer funded by a project with a conference budget. I’ve been to a couple that we hosted ourselves at the Centre for Digital Music, but I think the last one I went to anywhere else was the 2014 Linux Audio Conference in Karlsruhe. I don’t mind this situation (I don’t like to travel away from my family anyway), I just mention it to give context for why a long-time academic employee like me should bother to write up a conference at all!

Here are my notes — on things I liked and things I didn’t — in roughly chronological order.

The venue is interesting, quite fancy, and completely new to me. (It is called CodeNode.) I’m a bit boggled that there is such a big space right in the middle of the City given over to developer events. I probably shouldn’t be boggling at that any more, but I can’t help it.
Nice furniture too.

The attendees are amazingly homogeneous. I probably wouldn’t have even noticed this, back when I was tangentially involved in the commercial audio development world, as I was part of the homogeneity. But our research group is a fair bit more diverse and I’m a bit more perceptive now. From the attendance of this event, you would conclude that 98% of audio developers are male and 90% are white people from northern Europe.
When I have been involved in organising events in academia, we have found it hard to get a speaker lineup that is as diverse as the population of potential attendees (i.e. the classic all-male panel problem). I have failed badly at this, even when trying hard — I am definitely part of the problem when it comes to conference organisation. Here, though, my perception is the other way around: the speakers are a closer reflection of what I perceive as the actual population than the attendees are.

Talks I went to:

Day 2 (i.e. the first day of the talks):

• The future is wide: SIMD, vector classes and branchless algorithms for audio synthesis by Angus Hewlett of FXpansion (now employed by ROLI). A topic I’m interested in and he has clearly done solid work on (see here), but it quickly reached the realms of tweaks I personally am probably never going to need. The most heartening lesson I learned was that compilers are getting better and better at auto-vectorisation.
• Exploring time-frequency space with the Gaborator by Andreas Gustafsson. I loved this. It was about computing short-time constant-Q transforms of music audio and presenting the results in an interactive way. This is well-trodden territory: I have worked on more than one implementation of a constant-Q transform myself, and on visualising the results. But I really appreciated his dedication to optimising the transform (which appears to be quicker and more invertible than my best implementation) and his imagination in rendering it (reusing the Leaflet mapping API to display time-frequency “maps”). There is a demo of this here and I like it a lot.
So I was sitting there thinking “yes! nice work!”, but when it came to the questions, it was apparent that people didn’t really get how nice it was. I wanted to pretend to ask a question, just in order to say “I like it!”. But I didn’t, and then I never managed to work up to introducing myself to Andreas afterwards. I feel bad and I wish I had.
• The development of Ableton Live by Friedemann Schautz. This talk could only disappoint, after its title. But I had to attend anyway. It was a broad review of observations from the development of Live 10, and although I didn’t learn much, I did like Friedemann and thought I would happily work on a team he was running.
• The amazing usefulness of band-limited impulse trains by Stefan Stenzel of Waldorf. This was a nice old-school piece. Who can resist an impulse train? Not I.
• Some interesting phenomena in nonlinear oscillators by André Bergner of Native Instruments. André is a compelling speaker who uses hand-drawn slides (I approve) and this was a neat mathematical talk, though I wasn’t able to stay to the end of it.

Day 3 (second and final day of talks):

• The human in the musical loop (keynote) by Elaine Chew. Elaine is a professor in my group and I know some of her work quite well, but her keynote was exactly what I needed at this time, first thing in the morning on the second day. After a day of bits-driven talks, this was a piece about performers and listeners from someone who is technologically adept herself, and curious, but talks about music first. Elaine is also very calm, which was useful when the projector hardware gave up during her talk and stopped working for a good few minutes. I think as a result she had to hurry the closing topic (about the heartbeat project) which was a pity, as it could have been fascinating to have expanded on this a bit more.
Some of what Elaine talked about was more than a decade old, and I think this is one of the purposes of professors: to recall, and to be able to communicate, relevant stuff that happened longer ago than any current research student remembers.
• The new C++17, and why it is good for you by Timur Doumler. The polar opposite of Elaine’s talk, but I was now well-cushioned for it. C++17 continues down the road of simplifying the “modern-language” capabilities C++ has been acquiring since C++11. Most memorable for me are destructuring bind, guaranteed copy elision on value return, variant types, and filesystem support in the standard library.
Destructuring bind is interesting and I’ve written about it separately.
• The use of std::variant in realtime DSP by Ian Hobson. A 50-minute slot, for a talk about which Timur Doumler’s earlier talk had already given away the twist! (Yes you can use std::variant, it doesn’t do any heap allocation.) Ambitious. This was a most satisfying talk anyway, as it was all about performance measurements and other very concrete stuff. No mention of the Expression Problem though.
• Reactive Extensions (Rx) in JUCE by Martin Finke. I have never used either React or JUCE so I thought this would be perfect for me. I had a question lined up: “What is JUCE?” but I didn’t dare use it. The talk was perfectly comprehensible and quite enlightening though, so my silly bit of attitude was quite misplaced. I may even end up using some of what I learned in it.

# C++17 destructuring bind

I know very little about C++17, but I attended a talk about it this morning and the syntax for destructuring bind caught my attention.

This is a feature widely supported in other languages, where you assign a complex type to another complex declaration with individual names in it that match the original type, and you can then refer individually to the values that were assigned.

Python:

>>> [a,b,c] = [1,2,3]
>>> a
1
>>> b
2
>>> c
3
>>>

Standard ML (is that the Ur-language here?):

> val (a, b) = (1, 2);
val a = 1: int
val b = 2: int
> val { this, that } = { that = 1, this = 2 };
val that = 1: int
val this = 2: int

In C++17 the target syntax uses square brackets:

int a[2] = {1,2};
auto [x,y] = a;

It works regardless of whether the source is a structure, array, tuple, pair, etc.

What is interesting about it, in C++, is that it appears the source structure is always indexed by declaration order, rather than by name. That is, if you write

struct { int a; int b; } x { 1, 2 };
auto [b, a] = x;

then b will be 1 and a will be 2, even though in the original structure b was 2 and a was 1.

In other languages, a destructuring bind of a structure with named fields is performed using matching by name rather than by index. (See the SML example above.)

This highlights something that has been building for a while about C++. Since C++11 introduced the structure initialisation syntax in which you just list structure values one by one, we have increasingly accepted referring to structure elements by declaration order rather than by name. Someone who swapped two structure elements of the same type could break an awful lot of code without the compiler noticing. This doesn’t feel very safe, for a supposedly type-safe (ish) language.

And it isn’t the way other languages work. I don’t know any other language that will happily destructure a named structure by index rather than name, or even construct one (even the structure initialisers in C, since C99, are safer).

I’d love to know whether this has affected many people in practice, or whether I’m just being a bit suspicious.