Repoint: A manager for checkouts of third-party source code dependencies

I’ve just tagged v1.0 of Repoint, a tool for managing library source code in a development project. Conceptually it sits somewhere between Mercurial/Git submodules and a package manager like npm. It is intended for use with languages or environments that don’t have a favoured package manager, or in situations where the dependent libraries themselves aren’t aware that they are being package-managed. Essentially, situations where you want, or need, to be a bit hands-off from any actual package manager. I use it for projects in C++ and SML among other things.

Like npm, Bundler, Composer etc., Repoint refers to a project spec file that you provide that lists the libraries you want to bring in to your project directory (and which are brought in to the project directory, not installed to a central location). Like them, it creates a lock file to record the versions that were actually installed, which you can commit for repeatable builds. But unlike npm et al, all Repoint actually does is clone from the libraries’ upstream repository URLs into a subdirectory of the project directory, just as happens with submodules, and then report accurately on their status compared with their upstream repositories later

The expected deployment of Repoint consists of copying the Repoint files into the project directory, committing them along with everything else, and running Repoint from there, in the manner of a configure script — so that developers generally don’t have to install it either. It’s portable and it works the same on Linux, macOS, or Windows. Things are not always quite that simple, but most of the time they’re close.

At its simplest, Repoint just checks stuff out from Git or whatever for you, which doesn’t look very exciting. An example on Windows:

repoint

Simple though Repoint’s basic usage is, it can run things pretty rigorously across its three supported version-control systems (git, hg, svn), it gets a lot of annoying corner cases right, and it is solid, reliable, and well-tested across platforms. The README has more documentation, including of some more advanced features.

Is this of any use to me?

Repoint might be relevant to your project if all of the following apply:

  • You are developing with a programming language or environment that has no obvious single answer to the “what package manager should I use?” question; and
  • Your code project depends on one or more external libraries that are published in source form through public version-control URLs; and
  • You can’t assume that a person compiling your code has those libraries installed already; and
  • You don’t want to copy the libraries into your own version-control repo to form a Giant Monorepo; and
  • Most of your dependent libraries do not similarly depend on other libraries (Repoint doesn’t support recursive dependencies at all).

Beyond mere relevance, Repoint might be actively useful to your project if any of the following also apply:

  • The libraries you’re using are published through a mixture of version-control systems, e.g. some use Git but others Mercurial or Subversion; or
  • The libraries you’re using and, possibly, your own project might change from one version-control system to another at some point in the future.

See the README for more caveats and general documentation.

Example

The biggest current example of a project using Repoint is Sonic Visualiser. If you check out its code from Github or from the SoundSoftware code site and run its configure script, it will call out to repoint install to get the necessary dependencies. (On platforms that don’t use the configure script, you have to run Repoint yourself.)

Note that if you download a Sonic Visualiser source code tarball, there is no reference to Repoint in it and the Repoint script is never run — Repoint is very much an active-developer tool, and it includes an archive function that bundles up all the dependent libraries into a tarball so that people building or deploying the end result aren’t burdened with any additional utilities to use.

I also use Repoint in various smaller projects. If you’re browsing around looking at them, note that it wasn’t originally called Repoint — its working title in earlier versions was vext and I haven’t quite finished switching the repos over. Those earlier versions work fine of course, they just use different words.

 

What does a convolutional neural net actually do when you run it?

Convolutional neural networks (or convnets or CNNs) are a staple of “deep learning”. There are many tutorials available that describe what they do, either mathematically or via quasi-mystical appeals to intuition, and introduce how to train and use them, often with image classification examples.

This post has a narrower focus. As a programmer, I am happy processing floating-point data in languages like C++, but I’m more likely to be building applications with pre-trained models than training new ones. Say I have a pre-trained convnet image classifier, and I use it to carry out a single classification of one image (“tell me whether this shows a pig, cow, or sheep”) — what does that actually mean, in terms of code?

I’ll go through an example in the form of a small hand-written C++ convolutional net that contains only the “execution” code, no training logic. It will be very much an illustrative implementation, not production code. It will use pre-trained data adapted from a similar model in Keras, on which more later.

The specific example I am using is a typical image classification one: identifying which of five types of flower is visible in a small colour image (inspired by this tutorial). All the code I’m describing can be found in this Github repository.

Networks, layers, weights, training

A network in this context is a pipeline of functions through which data is passed, with the output of each function going directly to the input of the next one. Each of these functions is known as a layer.

Our example will begin by supplying the image data as input to the first layer, and end with the final layer returning a set of five numbers, one for each kind of flower the network knows about, indicating how likely the network thinks it is that the image shows that kind of flower.

So to run one classification, we just have to get the image data into the right format and pass it through each layer function in turn, and out pops the network’s best guess.

The layers themselves perform some mathematical operation on the data. Some of them do so by making use of a set of other values, provided separately, which they might multiply with the inputs in some way (in which case they are known as weights) or add to the return values (known as biases).

Coming up with suitable weight and bias values for these layers is what training is for. To train a network, the weights and biases for each such layer are randomly initialised, then the network is repeatedly run to provide classifications of known inputs; each of the network’s guesses is compared with the known class of the input, and the weights and biases of the network are updated depending on how accurate the classification was. Eventually, with luck, they will converge on some reliable values. Training is both difficult and tedious, so here we will happily leave it to machine-learning researchers.

These trainable layers are then typically sandwiched by fixed layers that provide non-linearities (also known as activation functions) and various data-manipulation bits and bobs like max-pooling. More on these later.

The layers are pure functions: a layer with a given set of weights and biases will always produce the same output for a given input. This is not always true within recurrent networks, which model time-varying data by maintaining state in the layers, but it is true of classic convolutional nets.

Our example network

arch3

This small network consists of four rounds of convolution + max-pooling layers, then two dense layers.

I constructed and trained this using Keras (in Python) based on a “flower pictures” dataset downloaded from the Tensorflow site, with 5 labelled classes of flowers. My Github repository contains a script obtain-data.sh which downloads and prepares the images, and a Python program with-keras/train.py which trains this network on them and exports the trained weights into a C++ file. The network isn’t a terribly good classifier, but that doesn’t bother me here.

I then re-implemented the model in C++ without using a machine-learning library. Here’s what the pipeline looks like. This can be found in the file flower.cpp in the repository. All of the functions called here are layer functions defined elsewhere in the same file, which I’ll go through in a moment. The variables with names beginning weights_ or biases_ are the trained values exported from Keras. Their definitions can be found in weights.cpp.

vector<float>
classify(const vector<vector<vector<float>>> &image)
{
    vector<vector<vector<float>>> t;

    t = zeroPad(image, 1, 1);
    t = convolve(t, weights_firstConv, biases_firstConv);
    t = activation(t, "relu");
    t = maxPool(t, 2, 2);

    t = zeroPad(t, 1, 1);
    t = convolve(t, weights_secondConv, biases_secondConv);
    t = activation(t, "relu");
    t = maxPool(t, 2, 2);

    t = zeroPad(t, 1, 1);
    t = convolve(t, weights_thirdConv, biases_thirdConv);
    t = activation(t, "relu");
    t = maxPool(t, 2, 2);

    t = zeroPad(t, 1, 1);
    t = convolve(t, weights_fourthConv, biases_fourthConv);
    t = activation(t, "relu");
    t = maxPool(t, 2, 2);

    vector<float> flat = flatten(t);
    
    flat = dense(flat, weights_firstDense, biases_firstDense);
    flat = activation(flat, "relu");

    flat = dense(flat, weights_labeller, biases_labeller);
    flat = activation(flat, "softmax");

    return flat;
}

Many functions are used for more than one layer — for example we have a single convolve function used for four different layers, with different weights, bias values, and input shapes each time. This reuse is possible because layer functions don’t retain any state from one call to the next.

You can see that we pass the original image as input to the first layer, but subsequently we just take the return value from each function and pass it to the next one. These values have types that we are expressing as nested vectors of floats. They are all varieties of tensor, on which let me digress for a moment:

Tensors

For our purposes a tensor is a multidimensional array of numbers, in our case floats, of which vectors and matrices are special cases. A tensor has a shape, which we can write as a list of sizes, and the length of that list is the rank. (The word dimension can be ambiguous here and I’m going to try to avoid it… apart from that one time I used it a moment ago…)

Examples:

  • A matrix is a rank-2 tensor whose shape has two values, height and width. The matrix \begin{bmatrix}1.0&3.0&5.0\\2.0&4.0&6.0\end{bmatrix} has shape [2, 3], if we are giving the height first.
  • A C++ vector of floats is a rank-1 tensor, whose shape is a list of one value, the number of elements in the vector. The vector { 1.0, 2.0, 3.0 } has the shape [3].
  • A single number can also be considered a tensor, of rank 0, with shape [].

All of the values passed to and returned from our network layer functions are tensors of various shapes. For example, a colour image will be represented as a tensor of rank 3, as it has height, width, and a number of colour channels.

In this code, we are storing tensors using C++ vectors (for rank 1), vectors of vectors (for rank 2), and so on. This is transparent, makes indexing easy, and allows us to see the rank of each tensor directly from its type.

Production C++ frameworks don’t do it this way — they typically store tensors as values interleaved into a single big array, wrapped up in a tensor class that knows how to index into it. With such a design, unlike our code, all tensors will have the same C++ class type regardless of their rank.

There is a practical issue with the ordering of tensor indices. Knowing the shape of a tensor is not enough: we also have to know which index is which. For example, an RGB image that is 128 pixels square has shape [128, 128, 3] if we index the height, width, and colour channels in that order, or [3, 128, 128] if we separate out the individual colour channels and index them first. Unfortunately, both layouts are in common use. Keras exposes the choice through a flag and uses the names channels_last for the former layout, historically the default in Tensorflow as it often runs faster on CPUs, and channels_first for the latter, used by many systems as it parallelises better with GPUs. The code in this example uses channels_last ordering.

This definition of a tensor is good enough for us, but in mathematics a tensor is not just an array of numbers but an object that sits in an algebraic space and can be manipulated in ways intrinsic to that space. Our data structures make no attempt to capture this, just as a C++ vector makes no effort to capture the properties of a mathematical vector space.

Layers used in our model

Each layer function takes a tensor as input, which I refer to within the function as in. Trained layers also take further tensors containing the weights and biases for the layer.

Convolution layer

Here’s the nub of this one, implemented in the convolve function.

for (size_t k = 0; k < nkernels; ++k) {
    for (size_t y = 0; y < out_height; ++y) {
        for (size_t x = 0; x < out_width; ++x) {
            for (size_t c = 0; c < depth; ++c) {
                for (size_t ky = 0; ky < kernel_height; ++ky) {
                    for (size_t kx = 0; kx < kernel_width; ++kx) {
                        out[y][x][k] +=
                            weights[ky][kx][c][k] *
                                in[y + ky][x + kx][c];
                    }
                }
            }
            out[y][x][k] += biases[k];
        }
    }
}

Seven nested for-loops! That’s a lot of brute force.

The weights for this layer, supplied in a rank-4 tensor, define a set of convolution kernels (or filters). Each kernel is a rank-3 tensor, width-height-depth. For each pixel in the input, and for each channel in the colour depth, the kernel values for that channel are multiplied by the surrounding pixel values, depending on the width and height of the kernel, and summed into a single value which appears at the same location in the output. We therefore get an output tensor containing a matrix of (almost) the same size as the input image, for each of the kernels.

(Note for audio programmers: FFT-based fast convolution is not generally used here, as it works out slower for small kernels)

Vaguely speaking, this has the effect of transforming the input into a space determined by how much each pixel’s surroundings have in common with each of the learned kernel patterns, which presumably capture things like whether pixels have a common colour or whether there is common activity between a pixel and neighbouring pixels in a particular direction due to edges or lines in the image.

Usually the early convolution layers in an image classifier will explode the amount of data being passed through the network. In our small model, the first convolution layer takes in a 128×128 image with 3 colour channels, and returns a 128×128 output for each of 32 learned kernels: nearly ten times as much data, which will then be reduced gradually through later layers. But fewer weight values are needed to describe a convolution layer compared with the dense layers we’ll see later in our network, so the size of the trained layer (what you might call the “code size”) is relatively smaller.

I said above that the output matrices were “almost” the same size as the input. They have to be a bit smaller, as otherwise the kernel would overrun the edges. To allow for this, we precede the convolution layer with a…

Zero-padding layer

This can be found in the zeroPad function. It creates a new tensor filled with zeros and copies the input into it:

for (size_t y = 0; y < in_height; ++y) {
    for (size_t x = 0; x < in_width; ++x) {
        for (size_t c = 0; c < depth; ++c) {
            out[y + pad_y][x + pad_x][c] = in[y][x][c];
        }
    }
}

All this does is add a zero-valued border around all four edges of the image, in each channel, in order to make the image sufficiently bigger to allow for the extent of the convolution kernel.

Most real-world implementations of convolution layers have an option to pad the input when they do the convolution, avoiding the need for an explicit zero-padding step. In Keras for example the convolution layers have a padding parameter that can be either valid (no extra padding) or same (provide padding so the output and input sizes match). But I’m trying to keep the individual layer functions minimal, so I’m happy to separate this out.

Activation layer

Again, this is something often carried out in the trained layers themselves, but I’ve kept it separate to simplify those.

This introduces a non-linearity by applying some simple mathematical function to each value (independently) in the tensor passed to it.

Historically, for networks without convolution layers, the activation function has usually been some kind of sigmoid function mapping a linear input onto an S-shaped curve, retaining small differences in a critical area of the input range and flattening them out elsewhere.

The activation function following a convolution layer is more likely to be a simple rectifier. This gives us the most disappointing acronym in all of machine learning: ReLU, which stands for “rectified linear unit” and means nothing more exciting than

if (x < 0.0) {
    x = 0.0;
}

applied to each value in the tensor.

We have a different activation function right at the end of the network: softmax. This is a normalising function used to produce final classification estimates that resemble probabilities summing to 1:

float sum = 0.f;
for (size_t i = 0; i < sz; ++i) {
    out[i] = exp(out[i]);
    sum += out[i];
}
if (sum != 0.f) {
    for (size_t i = 0; i < sz; ++i) {
        out[i] /= sum;
    }
}

Max pooling layer

The initial convolution layer increases the amount of data in the network, and subsequent layers then reduce this into a smaller number of hopefully more meaningful higher-level values. Max pooling is one way this reduction happens. It reduces the resolution of its input in the height and width axes, by taking only the maximum of each NxM block of pixels, for some N and M which in our network are both always 2.

From the function maxPool:

for (size_t y = 0; y < out_height; ++y) {
    for (size_t x = 0; x < out_width; ++x) {
        for (size_t i = 0; i < pool_y; ++i) {
            for (size_t j = 0; j < pool_x; ++j) {
                for (size_t c = 0; c < depth; ++c) {
                    float value = in[y * pool_y + i][x * pool_x + j][c];
                    out[y][x][c] = max(out[y][x][c], value);
                }
            }
        }
    }
}

Flatten layer

Interleaves a rank-3 tensor into a rank-1 tensor, otherwise known as a single vector.

for (size_t y = 0; y < height; ++y) {
    for (size_t x = 0; x < width; ++x) {
        for (size_t c = 0; c < depth; ++c) {
            out[i++] = in[y][x][c];
        }
    }
}

Why? Because that’s what the following dense layer expects as its input. By this point we are saying that we have used as much of the structure in the input as we’re going to use, to produce a set of values that we now treat as individual features rather than as a structure.

With a library that represents tensors using an interleaved array, this will be a no-op, apart from changing the nominal rank of the tensor.

Dense layer

A dense or fully-connected layer is an old-school neural network layer. It consists of a single matrix multiplication, multiplying the input vector by the matrix of learned weights, then a vector addition of bias values. From our dense function:

vector<float> out(out_size, 0.f);

for (size_t i = 0; i < in_size; ++i) {
    for (size_t j = 0; j < out_size; ++j) {
        out[j] += weights[i][j] * in[i];
    }
}

for (size_t j = 0; j < out_size; ++j) {
    out[j] += biases[j];
}

Our network has two of these layers, the second of which produces the final prediction values.

That’s pretty much it for the code. There is a bit of boilerplate around each function, and some extra code to load in an image file using libpng, but that’s about all.

There is one other kind of layer that appears in the original Keras model: a dropout layer. This doesn’t appear in our code because it is only used while training. It discards a certain proportion of the values passed to it as input, something that can help to make the following layer more robust to error when trained.

Further notes

Precision and repeatability

Training a network is a process that involves randomness. A given model architecture can produce quite different results on separate training runs.

This is not the case when running a trained network to make a classification or prediction. Two implementations of a network that both use the same basic arithmetic datatypes (e.g. 32-bit floats) should produce the same results when given the same set of trained weights and biases. If you are trying to reproduce a network’s behaviour using a different library or framework, and you have the same trained weights to use with both implementations, and your results look only “sort of” similar, then they’re probably wrong.

The differences between implementations in terms of channel ordering and indexing can be problematic in this respect. Consider the convolution function with its seven nested for-loops. Load any of those weights with the wrong index or in the wrong order and of course it won’t work, and there are a lot of possible permutations there. There seems to be more than one opinion about how kernel weights should be indexed, for example, and it’s easy to get them upside-down if you’re converting from one framework to another.

Should I use code like this in production?

Probably not!

First and most trivially, this isn’t a very efficient arrangement of layers. Separating out the zero-padding and activation functions means more memory allocation and copying.

Second, there are many ways to speed up the expensive brute-force layers (that is, the convolution and dense layers) dramatically even on the CPU, using vectorisation, careful cache and memory management, and faster matrix and convolution algorithms. If you don’t take advantage of these, you’re wasting time for no purpose; if you do, you’re repeating work other people are doing in libraries already.

I adapted the example network to use the tiny-dnn header-only C++ library, and it ran roughly twice as fast as the code above. (I’m a sucker for libraries named tiny-something, even if, as in this case, they are no longer all that tiny.) That version of the example code can be found in the with-tiny-dnn directory in the repo. It’s ugly, because I had to load and convert the weights from a system that uses channels-last format into one that expects channels-first. But I only had to write that code once, and it wouldn’t have been necessary if I’d trained the model in a channels-first layout. I might also have overlooked some simpler way to do it.

Third, this approach is fragile in the face of changes to the model architecture, which have to be duplicated exactly across the model implementation and the separately-loaded weights. It would be preferable to load the model architecture into your program, rather than load the weights into a hand-written architecture. To this end it seems to make sense to use a library that supports importing and exporting models automatically.

All this said, I find it highly liberating to realise that one could just sketch out these few lines of code and have a working result.

 

Naming conventions in Standard ML

Many programming languages have a standard document that describes how to write and capitalise the names of functions, variables, and source files. It’s especially useful to have a standard for writing names made up from more than one word, where there are various options for how to join the words: “camel case”, which looks likeThis (with a capital letter “hump” in the middle), or “snake case”, which is underscore_separated.

I think Java in the mid-90s was the first really mainstream language to standardise file and variable naming conventions. The Java package mechanism requires files to be laid out in a particular way, and Sun published Java coding conventions which quickly became an effective standard for class and variable naming. Other languages followed. Python has had a standard that covers naming (PEP8) since 2001. More recent examples include Go and Swift.

Older languages tend to be less consistent. C++ is a mess: the standard library and most official example material uses snake_case for most names, but a great many developers, including those on most of the projects I’ve worked on, prefer camelCase, with capital initials for class names. File names are even more various: C++ source files are seen with .cpp, .cxx, .cc, and .C extensions; C++ header files with .h, .hpp, or no extension at all.

Standard ML (SML) is also a mess, and an interesting one because the language itself was standardised in 1990 and has been completely unchanged since the standard was revised in 1997. So although it is super-standardised, it’s a bit too old to have caught the wider shift in sentiment toward prescribing things like naming and file structure.

The SML standard is formal and very focused. It says nothing about coding style or naming, contains almost no examples using compound names, says nothing about filenames or file organisation, and specifies no way for one file to refer to another — the standard is indifferent to whether your source code is held in a file at all.

In trying to establish what naming conventions to use for my own code, I decided to look around at some existing libraries in SML to see what they had settled on.

The Basis library

SML has a standard library, the Basis library, which is a bit more recent than the language itself. Although it isn’t prescriptive, the library does use certain conventions itself and the introductory notes explain what they are. These cover only names of things within a program — not filenames, which are left up to the implementor of the standard. I’ll refer to them in the table below.

The Cornell style guide

Top search result for “SML naming conventions” for me is this online style guide for the Cornell CS312 course. It doesn’t cover file naming. Given the limited industry uptake for SML, an academic guide may be proportionately more influential than for other languages. I’ll mention this guide below as well.

Other code I looked at

I took a look at the following code:

  • The source of the MLton, MLKit, and SMLSharp compilers (excluding accompanying utility libraries)
  • The Basis library implementations shipped with MLton and SMLSharp
  • The SML/NJ extended library
  • The source of the Ur/Web language
  • The Ponyo library, an interesting fledgling effort to produce a broader base library than the Basis

In total, about 444,500 lines of code across 1790 SML source files. Some (presumably automatically-generated) source files are very long; while the mean file length is 248 lines including comments and blanks, the median is only 47.

Names within the language

The SML language has at least seven categories of things that need names: variables, type names, datatype constructors, exceptions, structures, signatures, and functors.

(By “variables” I really mean bindings, i.e. the vast majority of ordinary things with names: things that in a procedural language might include function names, variable names, and constant declarations. I’m using the word “variable” because it’s such a familiar everyday programming term.)

Source Variable Type name Datatype constructor Exception Structure Signature Functor
mlton variableName (mixed) DatatypeCtor ExceptionName* StructureName SIGNATURE_NAME FunctorName
mlkit (mixed) (mixed) DatatypeCtor* ExceptionName* StructureName SIGNATURE_NAME FunctorName
smlsharp variableName typeName* DATATYPE_CTOR* ExceptionName StructureName SIGNATURE_NAME FunctorName
basis variableName type_name DATATYPE_CTOR ExceptionName StructureName SIGNATURE_NAME FunctorName
smlnj-lib variableName type_name DATATYPE_CTOR ExceptionName StructureName SIGNATURE_NAME FunctorNameFn
urweb variableName type_name* DatatypeCtor ExceptionName StructureName SIGNATURE_NAME FunctorNameFn
ponyo variableName typeName DatatypeCtor ExceptionName Structure_Name SIGNATURE_NAME Functor_Name
cornell variableName type_name DatatypeCtor ExceptionName StructureName SIGNATURE_NAME FunctorName

* mostly

Here’s what I found, categorised into universal conventions, usual conventions, and “other”.

Universal

The following is the only universal convention:

Signature
SIGNATURE_NAME

The only code I found that doesn’t follow this convention is in the SML standard itself, which omits the underscore (like SIGNATURENAME).

Usual

The following conventions are not universal, but more popular than any other.

Variable Type name Exception Structure Functor
variableName type_name ExceptionName StructureName FunctorName

Camel case is clearly idiomatic for everything except type names. MLKit contains some snake-cased bindings as well, but none of the other libraries did. I like snake case in SML and I’ve written a fair bit of code using it myself; I hadn’t realised until now how uncommon it was. (It’s more common in SML’s sibling language OCaml. Ironic that, of the three very similar languages SML, OCaml, and F#, the only one not to use camel case is called OCaml.)

I spotted a handful of all-caps exception names and some camel case type names, but no library preferred those consistently.

The Ponyo library differs from the above for structures (Structure_Name) and functors (Functor_Name).

The SML/NJ library sort-of differs for functors, which are given a Fn suffix (FunctorNameFn). But you could think of this as part of the name, in which case the convention is the same.

Most type and datatype names used in public APIs are single words, or even single letters, so the convention often doesn’t matter for those.

Other

There seems to be no consensus about datatype constructors — I found DatatypeConstructor and DATATYPE_CONSTRUCTOR in roughly equal number.

Filenames

Nothing in the SML standard or Basis library cares about what source files are called, what file extension they use, or how you divide your code up among them. Some compilers might care, but most don’t. The business of telling the compiler which files a program consists of, or of expressing any relationships between files, is left up to external tools. SML has neither header files nor import directives.

This makes fertile ground for variety in naming schemes.

I’m going to consider only filenames that are associated with a primary structure, signature, or functor. Here’s the table.

Source Structure Signature Functor
mlton structure-name.sml signature-name.sig functor-name.fun
mlkit StructureName.sml SIGNATURE_NAME.sml* FunctorName.sml
smlsharp StructureName.sml SIGNATURE_NAME.sig* FunctorName.sml
mlton-basis structure-name.sml signature-name.sig functor-name.fun
smlsharp-basis StructureName.sml SIGNATURE_NAME.sig (none)
snlnj-lib structure-name.sml signature-name-sig.sml functor-name-fn.sml
urweb structure_name.sml signature_name.sig (n/a)
ponyo Structure_Name.sml SIGNATURE_NAME.ML Functor_Name.sml

* mostly

Clearly very inconsistent. There are no universal or usual conventions, only “other”.

Behind this there is a wider question about code organisation in files — should each signature live in its own file? Each structure? In many cases they do, but that is also far from universal.

If you use a scheme in which filenames are clearly derived from signature and structure names, does that mean you shouldn’t put more than one structure in the same file? What do you do with code that is not in any structure? Really it’s a pity to have to think about filenames at all, in a language that is so completely indifferent to file structure.

A Reasonable Recommendation

A plausible set of rules based on the above.

For names within the language:

Variable Type name Datatype constructor Exception Structure Signature Functor
variableName type_name DATATYPE_CTOR ExceptionName StructureName SIGNATURE_NAME FunctorName

This is the style used by the Basis library. Apart from datatype constructors, everything here was in the majority within the libraries I looked at.

For datatype constructors it seems reasonable to pick the most visible option and one that is consistent with the names in Basis. (This differs from the Cornell guide, however.) There is no confusion between these and signature names, because signature names never appear anywhere except in the declaration lines for those signatures and the structures that implement them.

For filenames:

Structure Signature Functor
structure-name.sml signature-name.sig functor-name.sml

The logic here is:

  • It’s still not a great idea to expect a case sensitive filesystem, so all-one-case is good
  • Generally use .sml extension for SML source
  • But the .sig extension for signatures seems very widely used, and it’s fair to make public signatures as easy to spot as possible
  • The .ml extension is not a great idea because it clashes with OCaml
  • The .fun extension used by MLton is a bit obscure, and you don’t always want to separate out functors (if you want to make functors more distinctive, give them names ending in Fn, as the SML/NJ library does).

 

Undergraduate programming languages

I read two quite different articles about programming in academia today.

I don’t know Yossi Kreinin, and when his piece Why bad scientific code beats code following “best practices” appeared on the Hacker News front page, I guessed that I probably wouldn’t agree with it. I’m a programmer working in academia who has spent some time trying to find ways to improve the code that researchers write, and I don’t want to be told I’m wasting my time.

But on the whole I agreed with him. A totally naive programmer is going to produce messy, organic, but basically linear code that will usually be easier to understand and work with than the code of someone who has learned a bit of Java and wants to put everything in a factory class. The part I don’t agree with of course, and that I suspect he put in just for the sake of the headline, is that these really are best practices.

I think that one of the hidden goals of a project like Software Carpentry is to teach that the reason software developers appear to be “too clever”, and somehow unreachable for the scientist outside the software field, is that they are being too clever. Programming gets over-complicated; you can learn to write good code (that is, readable code) more easily than you can learn to write bad code the professional programmers’ way.

I also identified with Yossi’s line that “I claim to have repented, mostly”. Most of the code I’ve written in my career is not very good, by this standard. It’s a long path to enlightenment.

The other article was by my friend Christophe Rhodes: What is a good language to teach at undergraduate level for Computing degrees?

A computing degree is an odd thing. Computer science is the theory of how computers and programs work. Computing, as a subject, is computer science plus some stuff about how we should actually program them in order to get a job done. The two are very different.

Christophe breaks down objectives of a computing degree: Think, experiment, job, career, study, society. I’m not very familiar with the study objective, which refers to postgraduate study in a computing department (something I never did). Think refers to teaching students how to think and solve problems computationally. I think he may be missing a step, and it may be useful to separate thinking about how the computer works (“understand”?) from thinking about how the programmer can work.

There is also, of course, the question of how to avoid making your programmer feel too clever.

As an undergraduate in 1990-94, I was taught, in this order, Prolog, C, ML, 68k assembly language, and C++. I also learned some Lisp, though I can’t remember being formally taught it. Prolog, C, and the assembly language were all good bases for what I referred to as the understand objective, getting something out of the history of computing and learning about how computers work. ML was a wonderful introduction to thinking, and it’s no coincidence that I’ve recently been programming in an ML variant as an engaging alternative to work.

The hard one to evaluate is C++. It was a very poor teaching language for object-oriented programming, which is a pity because at the time we learned it, I didn’t get object-oriented programming at all. And we learned nothing about any other kind of programming from it, having already been taught three different high-level languages. So it contributed very poorly to the think objective and not at all to the understand one. It’s an awful language to learn to write clear, reliable code in, therefore bad from the society perspective, and you can (as I have) spend 20 years learning to write it, which is obviously ridiculous, hence you would think also bad from the job perspective.

But it turns out that C++ has been the most valuable job language there could be, because (a) it is resiliently portable: it turns out that write-once-run-anywhere with a virtual machine comes and goes in waves, depending on the whim of the top operating system provider of the time, but compiling to machine code seems to be eternal; (b) it is just about able to ape a number of programming paradigms, so you can get away with adopting a style without adopting another language; and (c) its complexity means that it looks good on your CV. I honestly wish, as a 20+-year C++ programmer, that we could make it so that nobody ever had to learn C++ again, but even now I don’t think that is the case.

The one goal completely unaddressed by my own degree, and almost every language I’ve learned since, is experiment. I think there are two essentials for this: a responsive interactive interpreter, and visual responsiveness, like a graphical scene environment or plotting tools. Most of the things you want to do here are probably growing up in Javascript and HTML5.

I don’t really know anything about teaching, about the actual on-the-ground business of making people learn stuff. So you shouldn’t listen to me on this next bit, I’m just daydreaming. But if I were planning a 3-year computing degree course in my head now, I think I would aim to teach, in this order:

  • Python, for the basics of procedural computing. It’s the cleanest language for doing satisfying simple loops and input/output transformations, and is a sound general-purpose language.
  • Clojure. I’ve decided now that I don’t so much care for Lisp syntax day-to-day, but a Lisp gives you so much to talk about and investigate without getting lost in the specific requirements of the language. And a Lisp on the JVM gives you more depth: advanced students could learn quite a lot about Java without ever actually being taught Java.
  • Some assembly language, probably ARM and preferably with a nice visible bit of circuit board on hand.
  • Javascript, but not aiming to teach the language so much as using a specific interactive framework and with a specific small game-development project to complete. I would probably also use Typescript rather than untyped Javascript.
  • Haskell. As an ML guy it was always my enemy, but Haskell is the functional language that has endured. It’s a follow-on course from Lisp.
  • C++. Because, as far as I can see at the moment, you pretty much have to. But please, give them modern-style pointer ownership and RAII (i.e. avoiding explicit heap allocation).
  • Python again, in a closing course that taught people how best to do things in an actual working environment. Testing, not being too much of a smartarse, etc.

But the odd thing about that set of languages is that, just as with my own degree, you never get a very good dedicated object-oriented language. Maybe Objective-C could replace C++; it’s probably a clearer pedagogical object language, but C++ is everywhere while Objective-C is effectively platform-specific, even if it is a very popular platform.

And be sure to teach them version control.

Oh, and don’t forget to add a double-entry book-keeping course. It’s probably more useful than the programming stuff.