Standard ML and how I’m compiling it

I mentioned in an earlier post that I was starting to use Standard ML for a (modest) real project. An early problem I encountered was how to manage builds, when using third-party library modules and multiple files of my own code.

I’m not talking here about anything advanced; I don’t even care yet about incremental compilation or packaging source for other people to build. My requirements are:

  1. produce an efficient native binary
  2. but get error reports quickly, for a fast build cycle
  3. and make it possible to load my code into a REPL for interactive experiments
  4. while using a small amount of third-party library code.

Several implementations of Standard ML exist (I listed some in my last post) and they’re broadly compatible at the core language level, but they disagree on how to compile things.

There’s little consensus between implementations about how to describe module dependencies, pull in third-party libraries, or compile a program that consists of more than one file. The standard says nothing about how different source files interact. There’s no include or import directive and no link between filesystem and module naming. Some implementations do extend the standard (most notably SML/NJ adds a few things) but there isn’t any standard way for a program to detect what language features are available to it. Implementations differ in purpose as well: some SML systems are primarily interactive environments, others primarily compilers.

So here’s what I found myself doing. I hope it might be useful to someone. But before that, I hope that somebody will post a comment that suggests a better way and makes the whole post obsolete.

My general principle was to set up a “default build” that used the MLton compiler, because that has the best combination of standards compliance and generating fast binaries; but then hack up a script to make the build also possible with other compilers, because MLton is slow to run and provides no REPL so is less useful during development.

Let’s build that up from a simple program, starting with Hello World. (I’m using Poly/ML as the “other” compiler—I’ll explain why later on.)

What compilers expect

I’m going to use a function for my Hello World, rather than just calling print at the top level. It’ll make things clearer in a moment. Like any effectively no-argument function in SML, it takes a single argument of unit type, ().

(* hello.sml *)
 
fun hello () =
    print "Hello, world!\n"

val _ = hello ()

And to compile it:

$ mlton hello.sml
$ ./hello
Hello, world!
$

When MLton compiles a program, it produces an executable that will evaluate the ML source you provide. There is no “main” function in the ML itself; instead the program will evaluate the bindings that appear at top level—in this case the final call to hello.

Compiling with MLton is not fast. Hello World takes more than two seconds to build, which is beyond the “immediate” threshold you need for a good compile/edit feedback loop.

Poly/ML compiles much faster, but it doesn’t like this code:

$ polyc hello.sml
Hello, world!
Error-Value or constructor (main) has not been declared
Found near PolyML.export ("/tmp/polyobj.15826.o", main)
Static Errors
$

The limitations of the Standard in Standard ML are quickly reached! It doesn’t define any kind of entry point for a compiled program.

Most SML compilers—including the Poly/ML compiler, but not MLton—work by reading the program into the interactive environment and then dumping it out as object code. Anything evaluated at top-level in the program gets executed in the interactive environment while reading the code in, rather than being exposed as an entry point in the resulting executable. We can see this happening in our example, as Poly/ML prints out “Hello, world!” before the compile error.

Instead Poly/ML expects to have a separate function called main, which will be the entry point for the program:

(* hello.sml *)
 
fun hello () =
    print "Hello, world!\n"

fun main () = hello ()

That works…

$ polyc hello.sml
$ ./a.out
Hello, world!
$

… and with Poly/ML it only takes a small fraction of a second to compile. But now it won’t work with MLton:

$ mlton hello.sml
$ ./hello
$

The main function is never called. We’re going to need to do something different for the two compilers, using a two-file setup something like this:

(* hello.sml *)

fun hello () =
    print "Hello, world!\n"

fun main () = hello ()
(* main.sml *)

val _ = main ()

Then when using MLton, we compile both hello.sml and main.sml; when using Poly/ML, we compile only hello.sml. In both cases we will end up with a single native binary executable that calls the hello function when invoked.

Using Basis files to drive the compiler

So now we have two files instead of one. How do we compile two files?

MLton’s multi-file compilation is driven by what it calls Basis files. These have an extension of .mlb, and in their simplest form just consist of a list of .sml filenames which are evaluated one after another into a single environment:

(* hello.mlb *)
hello.sml
main.sml

This file format is specific to the MLton compiler, it’s not part of the SML language.

The above example isn’t quite enough for a complete build with MLton—whereas the compiler automatically introduces the standard Basis library into scope when building a single .sml file, it doesn’t do so when building from a .mlb description. So any functions we use from the standard Basis library (such as print) will be missing.

We need to add a line at the start to include the standard library:

(* hello.mlb *)
$(SML_LIB)/basis/basis.mlb
hello.sml
main.sml

and then

$ mlton hello.mlb 
$ ./hello 
Hello, world!
$

This is simple enough so far. It doesn’t work directly with Poly/ML, because the polyc compiler doesn’t support the .mlb format. But polyc is itself just a shell script that loads a .sml file into Poly/ML and exports out the results. So we can make our own script that reads filenames from a simple .mlb file (omitting the one that requests the MLton-specific Basis library, as this is automatically loaded by Poly/ML).

You can find such a script here. At the moment I’m saving this in the project directory as a file named polybuild, so:

$ ./polybuild hello.mlb
$ ./hello
Hello, world!
$

The main.sml file gives no indication of where the main function it calls might live. The question of how to organise SML files together into an application is left completely outside the scope of the SML standard.

Now what if we introduce a dependency on a third-party library other than the standard one?

Third-party library dependencies

I was surprised to find no associative container type in the Standard ML Basis library—these containers have many applications and are often built in to languages nowadays. Let’s consider introducing one of them.

It happens that MLton (but not Poly/ML) ships with a port of the SML/NJ library which includes a couple of associative map containers. Here’s a little program that uses one:

(* dict.sml *)

structure Dict = SplayMapFn (struct
    type ord_key = string
    val compare = String.compare
    end)

val dict =
    let val d = Dict.empty
        val d = Dict.insert (d, "eggs", 5)
        val d = Dict.insert (d, "bacon", 2)
        val d = Dict.insert (d, "tomatoes", 3)
    in d
    end

fun main () =
    print ("Dictionary contains " ^
           (Int.toString (Dict.numItems dict)) ^ " items\n")

As before, the source code contains no indication of where SplayMapFn is to be found. (It’s a functor in the SML/NJ library—this is a type of function that, in this case, converts the generic map structure into a structure specialised for a key of type string, a bit like a template instantiation in C++. “Functor” is one of those terms that means something different in every language that uses it.)

Here’s a MLton-compatible .mlb file for this, in which we resolve the question of where to find our map type:

(* dict.mlb *)

$(SML_LIB)/basis/basis.mlb
$(SML_LIB)/smlnj-lib/Util/smlnj-lib.mlb
dict.sml
main.sml
$ mlton dict.mlb
$ ./dict
Dictionary contains 3 items
$

This doesn’t work with our polybuild script, as it includes another .mlb file and we have only dealt with files that include .sml files. We can’t fix this by just reading in the .mlb file it includes, because (you’ll see this if you have a look at the smlnj-lib.mlb file being included) it isn’t just a list of source files—the MLton developers have worked it carefully with a more advanced syntax to make sure no extraneous names are exported.

It’s still possible to concoct a Basis file for our program that refers only to other .sml files (and the Basis library). We just start by adding a reference to the pivotal file from the library that we are calling into, in this case splay-map-fn.sml, and then when we run the build and get an error, we add another file for each undefined symbol. The end result is this:

(* dict.mlb *)
$(SML_LIB)/basis/basis.mlb
$(SML_LIB)/smlnj-lib/Util/lib-base-sig.sml
$(SML_LIB)/smlnj-lib/Util/lib-base.sml
$(SML_LIB)/smlnj-lib/Util/ord-key-sig.sml
$(SML_LIB)/smlnj-lib/Util/ord-map-sig.sml
$(SML_LIB)/smlnj-lib/Util/splaytree-sig.sml
$(SML_LIB)/smlnj-lib/Util/splaytree.sml
$(SML_LIB)/smlnj-lib/Util/splay-map-fn.sml
dict.sml
main.sml

I fear problems with bigger builds, but this does work with our polybuild script.

As a way to introduce a simple associative array into a program, this looks a bit crazy. Every modern language has a map or hash type either built in to the language, or available with minimal syntax in a standard library. This program “should be” a two-liner with a one-line compiler invocation.

I do like the feeling of building my program from nice uniform bricks. Should I want to replace a container implementation with a different one, it’s clear how I would do so and it wouldn’t involve changing the calling code or losing type safety—that does feel good.

But such overhead, for something as basic as this: we’ve just got to hope that it works out well in the end, as we compose structures and the scale gets bigger. Will it?

Interaction with the REPL

Along with the polybuild script, I made a script called polyrepl that starts the Poly/ML interactive environment (remember MLton doesn’t have one) and loads the contents of an .mlb file into it for further investigations. You can find that one in the same repo, here.

Notes on other compilers

(I summarised some of the Standard ML compilers available in my previous post.)

  • MLKit supports .mlb files directly, but it doesn’t understand the extended syntax used in the MLton smlnj-lib.mlb, and it doesn’t seem to like compiling files found in locations it doesn’t have write permission for (such as system directories).
  • Moscow ML doesn’t support .mlb files directly, and appears to be harder to use existing module code with than Poly/ML because it seems to enforce structure/signature separation (signatures have to be separated out into .sig files, I think).
  • SML/NJ has its own compilation manager (called Compilation Manager) which seems rather complex, and I don’t really want to go there at the moment