Repoint: A manager for checkouts of third-party source code dependencies

I’ve just tagged v1.0 of Repoint, a tool for managing library source code in a development project. Conceptually it sits somewhere between Mercurial/Git submodules and a package manager like npm. It is intended for use with languages or environments that don’t have a favoured package manager, or in situations where the dependent libraries themselves aren’t aware that they are being package-managed. Essentially, situations where you want, or need, to be a bit hands-off from any actual package manager. I use it for projects in C++ and SML among other things.

Like npm, Bundler, Composer etc., Repoint refers to a project spec file that you provide that lists the libraries you want to bring in to your project directory (and which are brought in to the project directory, not installed to a central location). Like them, it creates a lock file to record the versions that were actually installed, which you can commit for repeatable builds. But unlike npm et al, all Repoint actually does is clone from the libraries’ upstream repository URLs into a subdirectory of the project directory, just as happens with submodules, and then report accurately on their status compared with their upstream repositories later

The expected deployment of Repoint consists of copying the Repoint files into the project directory, committing them along with everything else, and running Repoint from there, in the manner of a configure script — so that developers generally don’t have to install it either. It’s portable and it works the same on Linux, macOS, or Windows. Things are not always quite that simple, but most of the time they’re close.

At its simplest, Repoint just checks stuff out from Git or whatever for you, which doesn’t look very exciting. An example on Windows:

repoint

Simple though Repoint’s basic usage is, it can run things pretty rigorously across its three supported version-control systems (git, hg, svn), it gets a lot of annoying corner cases right, and it is solid, reliable, and well-tested across platforms. The README has more documentation, including of some more advanced features.

Is this of any use to me?

Repoint might be relevant to your project if all of the following apply:

  • You are developing with a programming language or environment that has no obvious single answer to the “what package manager should I use?” question; and
  • Your code project depends on one or more external libraries that are published in source form through public version-control URLs; and
  • You can’t assume that a person compiling your code has those libraries installed already; and
  • You don’t want to copy the libraries into your own version-control repo to form a Giant Monorepo; and
  • Most of your dependent libraries do not similarly depend on other libraries (Repoint doesn’t support recursive dependencies at all).

Beyond mere relevance, Repoint might be actively useful to your project if any of the following also apply:

  • The libraries you’re using are published through a mixture of version-control systems, e.g. some use Git but others Mercurial or Subversion; or
  • The libraries you’re using and, possibly, your own project might change from one version-control system to another at some point in the future.

See the README for more caveats and general documentation.

Example

The biggest current example of a project using Repoint is Sonic Visualiser. If you check out its code from Github or from the SoundSoftware code site and run its configure script, it will call out to repoint install to get the necessary dependencies. (On platforms that don’t use the configure script, you have to run Repoint yourself.)

Note that if you download a Sonic Visualiser source code tarball, there is no reference to Repoint in it and the Repoint script is never run — Repoint is very much an active-developer tool, and it includes an archive function that bundles up all the dependent libraries into a tarball so that people building or deploying the end result aren’t burdened with any additional utilities to use.

I also use Repoint in various smaller projects. If you’re browsing around looking at them, note that it wasn’t originally called Repoint — its working title in earlier versions was vext and I haven’t quite finished switching the repos over. Those earlier versions work fine of course, they just use different words.

 

Naming conventions in Standard ML

Many programming languages have a standard document that describes how to write and capitalise the names of functions, variables, and source files. It’s especially useful to have a standard for writing names made up from more than one word, where there are various options for how to join the words: “camel case”, which looks likeThis (with a capital letter “hump” in the middle), or “snake case”, which is underscore_separated.

I think Java in the mid-90s was the first really mainstream language to standardise file and variable naming conventions. The Java package mechanism requires files to be laid out in a particular way, and Sun published Java coding conventions which quickly became an effective standard for class and variable naming. Other languages followed. Python has had a standard that covers naming (PEP8) since 2001. More recent examples include Go and Swift.

Older languages tend to be less consistent. C++ is a mess: the standard library and most official example material uses snake_case for most names, but a great many developers, including those on most of the projects I’ve worked on, prefer camelCase, with capital initials for class names. File names are even more various: C++ source files are seen with .cpp, .cxx, .cc, and .C extensions; C++ header files with .h, .hpp, or no extension at all.

Standard ML (SML) is also a mess, and an interesting one because the language itself was standardised in 1990 and has been completely unchanged since the standard was revised in 1997. So although it is super-standardised, it’s a bit too old to have caught the wider shift in sentiment toward prescribing things like naming and file structure.

The SML standard is formal and very focused. It says nothing about coding style or naming, contains almost no examples using compound names, says nothing about filenames or file organisation, and specifies no way for one file to refer to another — the standard is indifferent to whether your source code is held in a file at all.

In trying to establish what naming conventions to use for my own code, I decided to look around at some existing libraries in SML to see what they had settled on.

The Basis library

SML has a standard library, the Basis library, which is a bit more recent than the language itself. Although it isn’t prescriptive, the library does use certain conventions itself and the introductory notes explain what they are. These cover only names of things within a program — not filenames, which are left up to the implementor of the standard. I’ll refer to them in the table below.

The Cornell style guide

Top search result for “SML naming conventions” for me is this online style guide for the Cornell CS312 course. It doesn’t cover file naming. Given the limited industry uptake for SML, an academic guide may be proportionately more influential than for other languages. I’ll mention this guide below as well.

Other code I looked at

I took a look at the following code:

  • The source of the MLton, MLKit, and SMLSharp compilers (excluding accompanying utility libraries)
  • The Basis library implementations shipped with MLton and SMLSharp
  • The SML/NJ extended library
  • The source of the Ur/Web language
  • The Ponyo library, an interesting fledgling effort to produce a broader base library than the Basis

In total, about 444,500 lines of code across 1790 SML source files. Some (presumably automatically-generated) source files are very long; while the mean file length is 248 lines including comments and blanks, the median is only 47.

Names within the language

The SML language has at least seven categories of things that need names: variables, type names, datatype constructors, exceptions, structures, signatures, and functors.

(By “variables” I really mean bindings, i.e. the vast majority of ordinary things with names: things that in a procedural language might include function names, variable names, and constant declarations. I’m using the word “variable” because it’s such a familiar everyday programming term.)

Source Variable Type name Datatype constructor Exception Structure Signature Functor
mlton variableName (mixed) DatatypeCtor ExceptionName* StructureName SIGNATURE_NAME FunctorName
mlkit (mixed) (mixed) DatatypeCtor* ExceptionName* StructureName SIGNATURE_NAME FunctorName
smlsharp variableName typeName* DATATYPE_CTOR* ExceptionName StructureName SIGNATURE_NAME FunctorName
basis variableName type_name DATATYPE_CTOR ExceptionName StructureName SIGNATURE_NAME FunctorName
smlnj-lib variableName type_name DATATYPE_CTOR ExceptionName StructureName SIGNATURE_NAME FunctorNameFn
urweb variableName type_name* DatatypeCtor ExceptionName StructureName SIGNATURE_NAME FunctorNameFn
ponyo variableName typeName DatatypeCtor ExceptionName Structure_Name SIGNATURE_NAME Functor_Name
cornell variableName type_name DatatypeCtor ExceptionName StructureName SIGNATURE_NAME FunctorName

* mostly

Here’s what I found, categorised into universal conventions, usual conventions, and “other”.

Universal

The following is the only universal convention:

Signature
SIGNATURE_NAME

The only code I found that doesn’t follow this convention is in the SML standard itself, which omits the underscore (like SIGNATURENAME).

Usual

The following conventions are not universal, but more popular than any other.

Variable Type name Exception Structure Functor
variableName type_name ExceptionName StructureName FunctorName

Camel case is clearly idiomatic for everything except type names. MLKit contains some snake-cased bindings as well, but none of the other libraries did. I like snake case in SML and I’ve written a fair bit of code using it myself; I hadn’t realised until now how uncommon it was. (It’s more common in SML’s sibling language OCaml. Ironic that, of the three very similar languages SML, OCaml, and F#, the only one not to use camel case is called OCaml.)

I spotted a handful of all-caps exception names and some camel case type names, but no library preferred those consistently.

The Ponyo library differs from the above for structures (Structure_Name) and functors (Functor_Name).

The SML/NJ library sort-of differs for functors, which are given a Fn suffix (FunctorNameFn). But you could think of this as part of the name, in which case the convention is the same.

Most type and datatype names used in public APIs are single words, or even single letters, so the convention often doesn’t matter for those.

Other

There seems to be no consensus about datatype constructors — I found DatatypeConstructor and DATATYPE_CONSTRUCTOR in roughly equal number.

Filenames

Nothing in the SML standard or Basis library cares about what source files are called, what file extension they use, or how you divide your code up among them. Some compilers might care, but most don’t. The business of telling the compiler which files a program consists of, or of expressing any relationships between files, is left up to external tools. SML has neither header files nor import directives.

This makes fertile ground for variety in naming schemes.

I’m going to consider only filenames that are associated with a primary structure, signature, or functor. Here’s the table.

Source Structure Signature Functor
mlton structure-name.sml signature-name.sig functor-name.fun
mlkit StructureName.sml SIGNATURE_NAME.sml* FunctorName.sml
smlsharp StructureName.sml SIGNATURE_NAME.sig* FunctorName.sml
mlton-basis structure-name.sml signature-name.sig functor-name.fun
smlsharp-basis StructureName.sml SIGNATURE_NAME.sig (none)
snlnj-lib structure-name.sml signature-name-sig.sml functor-name-fn.sml
urweb structure_name.sml signature_name.sig (n/a)
ponyo Structure_Name.sml SIGNATURE_NAME.ML Functor_Name.sml

* mostly

Clearly very inconsistent. There are no universal or usual conventions, only “other”.

Behind this there is a wider question about code organisation in files — should each signature live in its own file? Each structure? In many cases they do, but that is also far from universal.

If you use a scheme in which filenames are clearly derived from signature and structure names, does that mean you shouldn’t put more than one structure in the same file? What do you do with code that is not in any structure? Really it’s a pity to have to think about filenames at all, in a language that is so completely indifferent to file structure.

A Reasonable Recommendation

A plausible set of rules based on the above.

For names within the language:

Variable Type name Datatype constructor Exception Structure Signature Functor
variableName type_name DATATYPE_CTOR ExceptionName StructureName SIGNATURE_NAME FunctorName

This is the style used by the Basis library. Apart from datatype constructors, everything here was in the majority within the libraries I looked at.

For datatype constructors it seems reasonable to pick the most visible option and one that is consistent with the names in Basis. (This differs from the Cornell guide, however.) There is no confusion between these and signature names, because signature names never appear anywhere except in the declaration lines for those signatures and the structures that implement them.

For filenames:

Structure Signature Functor
structure-name.sml signature-name.sig functor-name.sml

The logic here is:

  • It’s still not a great idea to expect a case sensitive filesystem, so all-one-case is good
  • Generally use .sml extension for SML source
  • But the .sig extension for signatures seems very widely used, and it’s fair to make public signatures as easy to spot as possible
  • The .ml extension is not a great idea because it clashes with OCaml
  • The .fun extension used by MLton is a bit obscure, and you don’t always want to separate out functors (if you want to make functors more distinctive, give them names ending in Fn, as the SML/NJ library does).

 

F♯ has possibilities

A couple of months ago, Microsoft announced that they were buying a company called Xamarin, co-founded by the admirable Miguel “you can now flame me, I am full of love” de Icaza. (No sarcasm — I think Miguel is terrific, and the delightfully positive email linked above really stuck with me; if only I could have that attitude more often.)

As I understand it, Xamarin makes

  1. the Mono runtime, a portable third-party implementation of Microsoft’s .NET runtime for the C# and F# programming languages
  2. the eponymous Xamarin frameworks, which can be used with .NET to develop mobile apps for iOS and Android
  3. plugins for the Visual Studio IDE on Windows and the MonoDevelop IDE on OS/X to support mobile platform builds using Xamarin (the MonoDevelop-plus-plugins combo is known as Xamarin Studio).

Then a couple of days ago, the newly-acquired Xamarin declared

  1. that the Mono runtime was switching from LGPL/GPL licenses to MIT, allowing no-cost use in commercial applications
  2. that Microsoft were providing a patent promise (which I have not closely read) to remove concerns for commercial users of Mono
  3. that the Xamarin frameworks for iOS and Android, and the IDE plugins, were now free (of cost)
  4. that at some future point the Xamarin frameworks would be open sourced

I’m trying to unpick exactly what this could mean to me.

According to this discussion on Hacker News, the IDE plugins are remaining proprietary (which appears to mean that no IDE on Linux will be supported, since the IDE plugins are not currently available for Linux) but that “the Xamarin runtime and all the commandline tools you need to build apps” will be open sourced.

What this means

as I understand it,

  • Developers working on proprietary .NET applications will be able to build and release versions for other platforms than Windows, using Mono, at no extra cost
  • Developers working on open source .NET applications will be able to publish the ensemble with Mono under the MIT license if desired and will (apparently) be free of patent concerns
  • Developers will be able to make both proprietary and open source .NET applications for iOS and Android at no cost using Windows and OS/X
  • There is a possibility of being able to do builds of the above using Linux as well once the SDK is open, though probably without an IDE

Unrelatedly, there are separate projects afoot to provide native code and to-Javascript compilers for .NET bytecode.

What I’m interested in

I do a range of programming including a mixture of signal-processing and UI work, and am interested in exploring comprehensible, straightforward functional languages in the ML family (I wrote a little post about that here). Unlike many audio developers I have relatively limited demands on real-time response, but everything I write really wants to be cross-platform, because I’ve got specialised users on pretty every common platform and I have limited time and funding. (I understand that cross-platform apps are often inferior to single-platform apps, but they’re better than no apps.)

Xamarin doesn’t quite meet my expectations because it’s not really a cross-platform framework in the manner of Qt (which I use) or JUCE (which is widely used by others in my field). Instead of providing a common “widget set” across all platforms, Xamarin provides a separate thin interface to the native UI logic for each platform. It’s hard to judge how much more work this is, without knowing where the abstraction boundaries lie, but it may be a more relevant and sensible distinction on mobile platforms (where the differences are often in interaction and layout) than desktops (where the differences are mostly about how large numbers of individual widgets look).

An ideal combination of language and framework for me goes something like

  • strongly-typed, mostly functional, mostly immutable data structures
  • efficient unboxed support for floating-point vector types, including SIMD support
  • simple syntax (SML is nice)
  • low-cost foreign-function interface for C integration
  • high-level approach to multithreading
  • can work with gross UI layout in HTML5 (possibly DOM-update reactive UI style?)
  • good libraries for e.g. audio file I/O, signal processing, matrix algebra
  • can develop on Linux and deploy to all of Linux, Windows, OS/X, iOS, Android
  • free (or cheap, for proprietary apps) and open source (for open source apps)
  • has indenting Emacs mode

Where F# appears to score

F#, Microsoft’s ML-derived functional language for the .NET CLR, hits several of these. It has the typing, mostly-functional style, syntax, FFI, multithreading, libraries, deployment and licensing, and potentially the development platform (if the open source Xamarin framework should lead to the ability to build mobile apps directly from Linux).

I’m not sure about floating-point and vectors or about reusable HTML-style UI. I’d like to make the time to do another comparison of some ML-family languages, focusing on DSP-style float activity and on threading. I’ve done a bit of related work in Standard ML, which I could use as a basis for comparison.

Unless and until I get to do that, I’d love to hear any thoughts about F# as a general-purpose DSP-and-UI language, for a developer whose home platform is Linux.

My impression from the feedback on my earlier post was that the F# community is both enthusiastic and polite, and I notice that F# is the third most-loved language in the StackOverflow’s 2016 survey. Imagine a language that is useful no matter what platform you’re targeting, and whose developers love it. I can hope.

 

Fold: at the limit of comprehension

Fold” is a programming concept, a common name for a particular higher-order function that is widely used in functional programming languages. It’s a fairly simple thing, but in practice I think of it as representing the outer limit of concepts a normal programmer can reasonably be expected to grasp in day-to-day work.

What is fold? Fold is an elementary function for situations where you need to keep a tally of things. If you have a list of numbers and you want to tally them up in some way, for example to add them together, fold will do that.

Fold is also good at transforming sequences of things, and it can be used to reverse a list or modify each element of a sequence.

Fold is a useful fundamental function, and it’s widely used. I like using it. I just scanned about 440,000 lines of code (my own and other people’s) in ML-family languages and found about 14,000 that either called or defined a fold function.

Let me try to describe fold more precisely in English: It acts upon some sort of iterable object or container. It takes another function as an argument, one that the caller provides, and it calls that function repeatedly, providing it with one of the elements of the container each time, in order, as well as some sort of accumulator value. That function is expected to return an updated version of the accumulator each time it’s called, and that updated version gets passed in to the next call. Having called that function for every element, fold then returns the final value of the accumulator.

I tried, but I think that’s quite hard to follow. Examples are easier. Let’s add a list of numbers in Standard ML, by folding with the “+” function and an accumulator that starts at zero.

> val numbers = [1,2,3,4,5];
val numbers = [1, 2, 3, 4, 5]: int list
> foldl (op+) 0 numbers;
val it = 15: int

What’s difficult about fold?

  1. Fold is conceptually tricky because it’s such a general higher-order function. It captures a simple procedure that is common to a lot of actions that we are used to thinking of as distinct. For example, it can be used to add up a list of numbers, reverse a list of strings, increase all of the numbers in a sequence, calculate a ranking score for the set of webpages containing a search term, etc. These aren’t things that we habitually think of as similar actions, other than that they happen to involve a list or set of something. Especially, we aren’t used to giving a name to the general procedure involved and then treating individual activities of that type as specialisations of it. This is often a problem with higher-order functions (and let’s not go into monads).
  2. Fold is syntactically tricky, and its function type is confusing because there is no obvious logic determining either the order of arguments given to fold or the order of arguments accepted by the function you pass to it. I must have written hundreds of calls to fold, but I still hesitate each time to recall which order the arguments go in. Not surprising, since the argument order for the callback function differs between different languages’ libraries: some take the accumulator first and value second, others the other way around.
  3. Fold has several different names (some languages and libraries call it reduce, or inject) and none of them suggests any common English word for any of the actions it is actually used for. I suppose that’s because of point 1: we don’t name the general procedure. Fold is perhaps a marginally worse name than reduce or inject, but it’s still probably the most common.
  4. There’s more than one version of fold. Verity Stob cheekily asks “Do you fold to left or to the right? Do not provide too much information.” Left and right fold differ in the order in which they iterate through the container, so they usually produce different results, but there can also be profound differences between them in terms of performance and computability, especially when using lazy evaluation. This means you probably do have to know which is which. (See footnote below.)

A post about fold by James Hague a few years ago asked, “Is the difficulty many programmers have in grasping functional programming inherent in the basic concept of non-destructively operating on values, or is it in the popular abstractions that have been built-up to describe functional programming?” In this case I think it’s both. Fold is a good example of syntax failing us, and I think it’s also inherently a difficult abstraction to recognise (i.e. to spot the function application common to each activity). Fold is a fundamental operation in much of functional programming, but it doesn’t really feel like one because the abstraction is not comfortable. But besides that, many of the things fold is useful for are things that we would usually visualise in destructive terms: update the tally, push something onto the front of the list.

In Python the fold function (which Python calls reduce) was dropped from the built-in functions and moved into a separate module for Python 3. Guido van Rossum wrote, “apart from a few examples involving + or *, almost every time I see a reduce() call with a non-trivial function argument, I need to grab pen and paper to diagram what’s actually being fed into that function before I understand what the reduce() is supposed to do.” Instead the Python style for these activities usually involves destructively updating the accumulator.

Functional programming will surely never be really mainstream so long as fold appears in basic tutorials for it. Though in practice at least, because it’s such a general function, it can often be usefully hidden behind a more discoverable domain-specific API.

***

(Footnote. You can tell whether an implementation of fold is a left or right fold by applying it to the list “cons” function, which is often called “::”. If this reverses a list passed to it, you have a left fold. For example, the language Yeti has a function simply called fold; which is it? —

> fold (flip (::)) [] [1,2,3,4];
[4,3,2,1] is list<number>

So it’s a left fold.)

 

… and an FFT in Standard ML

While writing my earlier post on Javascript FFTs, I also (for fun) adapted the Nayuki FFT code into Standard ML. You can find it here.

The original idea was to see how performance of SML compiled to native code, and SML compiled to Javascript using smltojs, compared with the previously-tested Javascript implementations and with any other SML versions I could find. (There’s FFT and DFT code from Rory McGuire here, probably written for clarity rather than speed, plus versions I haven’t tried in the SML/NJ and MLKit test libraries.)

I didn’t get as far as doing a real comparison, but I did note that it ran at more or less the same speed when compiled natively with MLton as the Javascript version does when run in Firefox, and that compiling to JS with smltojs produced much slower code. I haven’t checked where the overhead lies.

Standard ML and how I’m compiling it

I mentioned in an earlier post that I was starting to use Standard ML for a (modest) real project. An early problem I encountered was how to manage builds, when using third-party library modules and multiple files of my own code.

I’m not talking here about anything advanced; I don’t even care yet about incremental compilation or packaging source for other people to build. My requirements are:

  1. produce an efficient native binary
  2. but get error reports quickly, for a fast build cycle
  3. and make it possible to load my code into a REPL for interactive experiments
  4. while using a small amount of third-party library code.

Several implementations of Standard ML exist (I listed some in my last post) and they’re broadly compatible at the core language level, but they disagree on how to compile things.

There’s little consensus between implementations about how to describe module dependencies, pull in third-party libraries, or compile a program that consists of more than one file. The standard says nothing about how different source files interact. There’s no include or import directive and no link between filesystem and module naming. Some implementations do extend the standard (most notably SML/NJ adds a few things) but there isn’t any standard way for a program to detect what language features are available to it. Implementations differ in purpose as well: some SML systems are primarily interactive environments, others primarily compilers.

So here’s what I found myself doing. I hope it might be useful to someone. But before that, I hope that somebody will post a comment that suggests a better way and makes the whole post obsolete.

My general principle was to set up a “default build” that used the MLton compiler, because that has the best combination of standards compliance and generating fast binaries; but then hack up a script to make the build also possible with other compilers, because MLton is slow to run and provides no REPL so is less useful during development.

Let’s build that up from a simple program, starting with Hello World. (I’m using Poly/ML as the “other” compiler—I’ll explain why later on.)

What compilers expect

I’m going to use a function for my Hello World, rather than just calling print at the top level. It’ll make things clearer in a moment. Like any effectively no-argument function in SML, it takes a single argument of unit type, ().

(* hello.sml *)
 
fun hello () =
    print "Hello, world!\n"

val _ = hello ()

And to compile it:

$ mlton hello.sml
$ ./hello
Hello, world!
$

When MLton compiles a program, it produces an executable that will evaluate the ML source you provide. There is no “main” function in the ML itself; instead the program will evaluate the bindings that appear at top level—in this case the final call to hello.

Compiling with MLton is not fast. Hello World takes more than two seconds to build, which is beyond the “immediate” threshold you need for a good compile/edit feedback loop.

Poly/ML compiles much faster, but it doesn’t like this code:

$ polyc hello.sml
Hello, world!
Error-Value or constructor (main) has not been declared
Found near PolyML.export ("/tmp/polyobj.15826.o", main)
Static Errors
$

The limitations of the Standard in Standard ML are quickly reached! It doesn’t define any kind of entry point for a compiled program.

Most SML compilers—including the Poly/ML compiler, but not MLton—work by reading the program into the interactive environment and then dumping it out as object code. Anything evaluated at top-level in the program gets executed in the interactive environment while reading the code in, rather than being exposed as an entry point in the resulting executable. We can see this happening in our example, as Poly/ML prints out “Hello, world!” before the compile error.

Instead Poly/ML expects to have a separate function called main, which will be the entry point for the program:

(* hello.sml *)
 
fun hello () =
    print "Hello, world!\n"

fun main () = hello ()

That works…

$ polyc hello.sml
$ ./a.out
Hello, world!
$

… and with Poly/ML it only takes a small fraction of a second to compile. But now it won’t work with MLton:

$ mlton hello.sml
$ ./hello
$

The main function is never called. We’re going to need to do something different for the two compilers, using a two-file setup something like this:

(* hello.sml *)

fun hello () =
    print "Hello, world!\n"

fun main () = hello ()
(* main.sml *)

val _ = main ()

Then when using MLton, we compile both hello.sml and main.sml; when using Poly/ML, we compile only hello.sml. In both cases we will end up with a single native binary executable that calls the hello function when invoked.

Using Basis files to drive the compiler

So now we have two files instead of one. How do we compile two files?

MLton’s multi-file compilation is driven by what it calls Basis files. These have an extension of .mlb, and in their simplest form just consist of a list of .sml filenames which are evaluated one after another into a single environment:

(* hello.mlb *)
hello.sml
main.sml

This file format is specific to the MLton compiler, it’s not part of the SML language.

The above example isn’t quite enough for a complete build with MLton—whereas the compiler automatically introduces the standard Basis library into scope when building a single .sml file, it doesn’t do so when building from a .mlb description. So any functions we use from the standard Basis library (such as print) will be missing.

We need to add a line at the start to include the standard library:

(* hello.mlb *)
$(SML_LIB)/basis/basis.mlb
hello.sml
main.sml

and then

$ mlton hello.mlb 
$ ./hello 
Hello, world!
$

This is simple enough so far. It doesn’t work directly with Poly/ML, because the polyc compiler doesn’t support the .mlb format. But polyc is itself just a shell script that loads a .sml file into Poly/ML and exports out the results. So we can make our own script that reads filenames from a simple .mlb file (omitting the one that requests the MLton-specific Basis library, as this is automatically loaded by Poly/ML).

You can find such a script here. At the moment I’m saving this in the project directory as a file named polybuild, so:

$ ./polybuild hello.mlb
$ ./hello
Hello, world!
$

The main.sml file gives no indication of where the main function it calls might live. The question of how to organise SML files together into an application is left completely outside the scope of the SML standard.

Now what if we introduce a dependency on a third-party library other than the standard one?

Third-party library dependencies

I was surprised to find no associative container type in the Standard ML Basis library—these containers have many applications and are often built in to languages nowadays. Let’s consider introducing one of them.

It happens that MLton (but not Poly/ML) ships with a port of the SML/NJ library which includes a couple of associative map containers. Here’s a little program that uses one:

(* dict.sml *)

structure Dict = SplayMapFn (struct
    type ord_key = string
    val compare = String.compare
    end)

val dict =
    let val d = Dict.empty
        val d = Dict.insert (d, "eggs", 5)
        val d = Dict.insert (d, "bacon", 2)
        val d = Dict.insert (d, "tomatoes", 3)
    in d
    end

fun main () =
    print ("Dictionary contains " ^
           (Int.toString (Dict.numItems dict)) ^ " items\n")

As before, the source code contains no indication of where SplayMapFn is to be found. (It’s a functor in the SML/NJ library—this is a type of function that, in this case, converts the generic map structure into a structure specialised for a key of type string, a bit like a template instantiation in C++. “Functor” is one of those terms that means something different in every language that uses it.)

Here’s a MLton-compatible .mlb file for this, in which we resolve the question of where to find our map type:

(* dict.mlb *)

$(SML_LIB)/basis/basis.mlb
$(SML_LIB)/smlnj-lib/Util/smlnj-lib.mlb
dict.sml
main.sml
$ mlton dict.mlb
$ ./dict
Dictionary contains 3 items
$

This doesn’t work with our polybuild script, as it includes another .mlb file and we have only dealt with files that include .sml files. We can’t fix this by just reading in the .mlb file it includes, because (you’ll see this if you have a look at the smlnj-lib.mlb file being included) it isn’t just a list of source files—the MLton developers have worked it carefully with a more advanced syntax to make sure no extraneous names are exported.

It’s still possible to concoct a Basis file for our program that refers only to other .sml files (and the Basis library). We just start by adding a reference to the pivotal file from the library that we are calling into, in this case splay-map-fn.sml, and then when we run the build and get an error, we add another file for each undefined symbol. The end result is this:

(* dict.mlb *)
$(SML_LIB)/basis/basis.mlb
$(SML_LIB)/smlnj-lib/Util/lib-base-sig.sml
$(SML_LIB)/smlnj-lib/Util/lib-base.sml
$(SML_LIB)/smlnj-lib/Util/ord-key-sig.sml
$(SML_LIB)/smlnj-lib/Util/ord-map-sig.sml
$(SML_LIB)/smlnj-lib/Util/splaytree-sig.sml
$(SML_LIB)/smlnj-lib/Util/splaytree.sml
$(SML_LIB)/smlnj-lib/Util/splay-map-fn.sml
dict.sml
main.sml

I fear problems with bigger builds, but this does work with our polybuild script.

As a way to introduce a simple associative array into a program, this looks a bit crazy. Every modern language has a map or hash type either built in to the language, or available with minimal syntax in a standard library. This program “should be” a two-liner with a one-line compiler invocation.

I do like the feeling of building my program from nice uniform bricks. Should I want to replace a container implementation with a different one, it’s clear how I would do so and it wouldn’t involve changing the calling code or losing type safety—that does feel good.

But such overhead, for something as basic as this: we’ve just got to hope that it works out well in the end, as we compose structures and the scale gets bigger. Will it?

Interaction with the REPL

Along with the polybuild script, I made a script called polyrepl that starts the Poly/ML interactive environment (remember MLton doesn’t have one) and loads the contents of an .mlb file into it for further investigations. You can find that one in the same repo, here.

Notes on other compilers

(I summarised some of the Standard ML compilers available in my previous post.)

  • MLKit supports .mlb files directly, but it doesn’t understand the extended syntax used in the MLton smlnj-lib.mlb, and it doesn’t seem to like compiling files found in locations it doesn’t have write permission for (such as system directories).
  • Moscow ML doesn’t support .mlb files directly, and appears to be harder to use existing module code with than Poly/ML because it seems to enforce structure/signature separation (signatures have to be separated out into .sig files, I think).
  • SML/NJ has its own compilation manager (called Compilation Manager) which seems rather complex, and I don’t really want to go there at the moment