Fold: at the limit of comprehension

Fold” is a programming concept, a common name for a particular higher-order function that is widely used in functional programming languages. It’s a fairly simple thing, but in practice I think of it as representing the outer limit of concepts a normal programmer can reasonably be expected to grasp in day-to-day work.

What is fold? Fold is an elementary function for situations where you need to keep a tally of things. If you have a list of numbers and you want to tally them up in some way, for example to add them together, fold will do that.

Fold is also good at transforming sequences of things, and it can be used to reverse a list or modify each element of a sequence.

Fold is a useful fundamental function, and it’s widely used. I like using it. I just scanned about 440,000 lines of code (my own and other people’s) in ML-family languages and found about 14,000 that either called or defined a fold function.

Let me try to describe fold more precisely in English: It acts upon some sort of iterable object or container. It takes another function as an argument, one that the caller provides, and it calls that function repeatedly, providing it with one of the elements of the container each time, in order, as well as some sort of accumulator value. That function is expected to return an updated version of the accumulator each time it’s called, and that updated version gets passed in to the next call. Having called that function for every element, fold then returns the final value of the accumulator.

I tried, but I think that’s quite hard to follow. Examples are easier. Let’s add a list of numbers in Standard ML, by folding with the “+” function and an accumulator that starts at zero.

> val numbers = [1,2,3,4,5];
val numbers = [1, 2, 3, 4, 5]: int list
> foldl (op+) 0 numbers;
val it = 15: int

What’s difficult about fold?

  1. Fold is conceptually tricky because it’s such a general higher-order function. It captures a simple procedure that is common to a lot of actions that we are used to thinking of as distinct. For example, it can be used to add up a list of numbers, reverse a list of strings, increase all of the numbers in a sequence, calculate a ranking score for the set of webpages containing a search term, etc. These aren’t things that we habitually think of as similar actions, other than that they happen to involve a list or set of something. Especially, we aren’t used to giving a name to the general procedure involved and then treating individual activities of that type as specialisations of it. This is often a problem with higher-order functions (and let’s not go into monads).
  2. Fold is syntactically tricky, and its function type is confusing because there is no obvious logic determining either the order of arguments given to fold or the order of arguments accepted by the function you pass to it. I must have written hundreds of calls to fold, but I still hesitate each time to recall which order the arguments go in. Not surprising, since the argument order for the callback function differs between different languages’ libraries: some take the accumulator first and value second, others the other way around.
  3. Fold has several different names (some languages and libraries call it reduce, or inject) and none of them suggests any common English word for any of the actions it is actually used for. I suppose that’s because of point 1: we don’t name the general procedure. Fold is perhaps a marginally worse name than reduce or inject, but it’s still probably the most common.
  4. There’s more than one version of fold. Verity Stob cheekily asks “Do you fold to left or to the right? Do not provide too much information.” Left and right fold differ in the order in which they iterate through the container, so they usually produce different results, but there can also be profound differences between them in terms of performance and computability, especially when using lazy evaluation. This means you probably do have to know which is which. (See footnote below.)

A post about fold by James Hague a few years ago asked, “Is the difficulty many programmers have in grasping functional programming inherent in the basic concept of non-destructively operating on values, or is it in the popular abstractions that have been built-up to describe functional programming?” In this case I think it’s both. Fold is a good example of syntax failing us, and I think it’s also inherently a difficult abstraction to recognise (i.e. to spot the function application common to each activity). Fold is a fundamental operation in much of functional programming, but it doesn’t really feel like one because the abstraction is not comfortable. But besides that, many of the things fold is useful for are things that we would usually visualise in destructive terms: update the tally, push something onto the front of the list.

In Python the fold function (which Python calls reduce) was dropped from the built-in functions and moved into a separate module for Python 3. Guido van Rossum wrote, “apart from a few examples involving + or *, almost every time I see a reduce() call with a non-trivial function argument, I need to grab pen and paper to diagram what’s actually being fed into that function before I understand what the reduce() is supposed to do.” Instead the Python style for these activities usually involves destructively updating the accumulator.

Functional programming will surely never be really mainstream so long as fold appears in basic tutorials for it. Though in practice at least, because it’s such a general function, it can often be usefully hidden behind a more discoverable domain-specific API.

***

(Footnote. You can tell whether an implementation of fold is a left or right fold by applying it to the list “cons” function, which is often called “::”. If this reverses a list passed to it, you have a left fold. For example, the language Yeti has a function simply called fold; which is it? —

> fold (flip (::)) [] [1,2,3,4];
[4,3,2,1] is list<number>

So it’s a left fold.)

 

Feedback on “Four MLs (and a Python)”

My last post got an unprecedented amount of attention after appearing on the popular site Hacker News. It spent about 14 hours on the front page there and got almost 11,000 unique visitors that day — a fair way above my usual daily average, for this whole blog, of about a hundred.

I combed through the comments on the Hacker News post, as well as entries on lobste.rs and Reddit. Counting only comments directly about the post (rather than to other commenters), I reckoned there were:

  • Nine comments responding only, or primarily, to performance differences in the language implementations;
  • Six suggestions to hoist regexp construction out of the split function in the OCaml version, to make it faster: four of them included code, and they were all quite polite;
  • Six objections that my Python code was un-idiomatic, un-Pythonic, or generally strange. Five of those proposed rewriting it to use the CSV parser module or pandas — good advice in general, though not really in line with the purpose of the post;
  • Six code rewrites in, enthusiasms about, or defences of, F♯;
  • Six comments suggesting other languages to look at (4 x Haskell; Scala; Mythryl);
  • Three observations that the Python code would run faster with PyPy;
  • Three people sharing my general feeling that it would be a good thing if Standard ML had more practical things going for it;
  • Two people having difficulty with my use of the term “Big Data” to describe a 300MB CSV file. (Sorry. I really only ever use the term ironically, but you weren’t to know that.)

I must say, OCaml developers seem like a keen bunch. Actually the vibe I get about both OCaml and F♯, from these few comments, is a happy one. About SML it’s more a sense of distant yearning. The Python comments were a bit grumpier, but that’s reasonable given that my post didn’t exactly engage with the way things actually get done in Python.

It’s quite obviously hard to resist a speed contest. I wasn’t initially going to include the runtime measurements in the post, but I couldn’t help myself either. Still, none of the coding decisions I made when writing the test programs had anything to do with speed. (Scalability, in a simplistic way, yes: I made sure not to read the whole file into memory, and to avoid non-tail-recursive code.) Then of course more people responded about performance than anything else. And even though I can see that that way lies madness, I’d still like to find time to follow up on the question of why the SML code didn’t go as quickly as the OCaml did.

Since I wrote that post, I’ve had a piece of real work come up for which these languages are quite well suited, involving parsing and manipulating a modest amount of symbolic data with a very specialised structure. I decided to do it using Standard ML this time, and hope to find out whether there’s a point at which it becomes more impractical than delightful.

Four MLs (and a Python)

I wrote a small command-line text processing program in four different ML-derived languages, to try to get a feel for how they compare in terms of syntax, library, and build-run cycles.

ML is a family of functional programming languages that have grown up during the past 40 years and more, with strong static typing, type inference, and eager evaluation. I tried out Standard ML, OCaml, Yeti, and F#, all compiling and running from a shell prompt on Linux.

The job was to write a utility that:

  • accepts the name of a CSV (comma-separated values) file as a command-line argument
  • reads all the lines from that file, each consisting of the same number of numeric columns
  • sums each column and prints out a single CSV line with the results
  • handles large inputs
  • fails if it finds a non-numeric column value or an inconsistent number of columns across lines (an uncaught exception is acceptable)

A toy exercise, but one that touches on file I/O, library support, string processing and numeric type conversion, error handling, and the build-invocation cycle.

I tested on a random Big Data CSV file that I had to hand; running the wc (word count) utility on it gives the size and a plausible lower bound for our program’s runtime:

$ time wc big-data.csv 
 337024 337024 315322496 big-data.csv

real 0m3.086s
user 0m3.050s
sys 0m0.037s
$

I’ve included timings throughout because I thought a couple of them were interesting, but they don’t tell us much except that none of the languages performed badly (with the slowest taking about 16 seconds on this file).

Finally I wrote the same thing in Python as well for comparison.

Practical disclaimer: If you actually have a CSV file you want to do things like this with, don’t use any of these languages. Do it with R instead, where this exercise takes three lines including file I/O. Or at least use an existing CSV-mangling library.

Here are the programs I ended up with, and my impressions.

Standard ML

Standard ML, or SML, is the oldest and “smallest” of the four and the only one to have a formal standard, fixed since 1997. Its standard library (the Basis library) is a more recent addition.

fun fold_stream f acc stream =
    case TextIO.inputLine stream of
	SOME line => fold_stream f (f (line, acc)) stream
      | NONE => acc
    
fun to_number str =
    case Real.fromString str of
	SOME r => r
      | NONE => raise Fail ("Invalid real: " ^ str)
		      
fun values_of_line line =
    let val fields = String.fields (fn c => c = #",") line in
	map to_number fields
    end

fun add_to [] values = values
  | add_to totals values =
    if length totals = length values then
	ListPair.map (Real.+) (totals, values)
    else raise Fail "Inconsistent-length rows"

fun sum_stream stream =
    fold_stream (fn (line, tot) => add_to tot (values_of_line line)) [] stream
		    
fun sum_and_print_file filename =
    let val stream = TextIO.openIn filename in
	let val result = sum_stream stream in
	    print ((String.concatWith "," (map Real.toString result)) ^ "\n")
	end;
	TextIO.closeIn stream
    end
							     
fun main () =
    case CommandLine.arguments () of [filename] => sum_and_print_file filename
      | _ => raise Fail "Exactly 1 filename must be given"

val () = main ()

(Note that although I haven’t included any type annotations, like all ML variants this is statically typed and the compiler enforces type consistency. There are no runtime type errors.)

This is the first SML program I’ve written since 23 years ago. I enjoyed writing it, even though it’s longer than I’d hoped. The Basis library doesn’t offer a whole lot, but it’s nicely structured and easy to understand. To my eye the syntax is fairly clear. I had some minor problems getting the syntax right first time—I kept itching to add end or semicolons in unnecessary places—but once written, it worked, and my first attempt was fine with very large input files.

I had fun messing around with a few different function compositions before settling on the one above, which takes the view that, since summing up a list is habitually expressed in functional languages as an application of fold, we could start with a function to apply a fold over the sequence of lines in a file.

More abstractly, there’s something delightful about writing a language with a small syntax that was fixed and standardised 18 years ago and that has more than one conforming implementation to choose from. C++ programmers (like me) have spent much of those 18 years worrying about which bits of which sprawling standards are available in which compiler. And let’s not talk about the lifespans of web development frameworks.

To build and run it I used the MLton native-code compiler:

$ time mlton -output sum-sml sum.sml

real 0m2.295s
user 0m2.160s
sys 0m0.103s
$ time ./sum-sml big-data.csv 
150.595368855,68.9467923856,[...]

real 0m16.383s
user 0m16.370s
sys 0m0.027s
$

The executable was a 336K native binary with dependencies on libm, libgmp, and libc. Although the compiler has a good reputation, this was (spoiler alert!) the slowest of these language examples both to build and to run. I also tried the PolyML compiler, with which it took less than a tenth of a second to compile but 26 seconds to run, and Moscow ML, which was also fast to compile but much slower to run.

OCaml

OCaml is a more recent language, from the same root but with a more freewheeling style. It seems to have more library support than SML and, almost certainly, more users. I started taking an interest in it recently because of its use in the Mirage OS unikernel project—but of these examples it’s the one in which I’m least confident in my ability to write idiomatic code.

(Edit: at least two commenters below have posted improved versions of this—thanks!)

open Str

let read_line chan =
  try Some (input_line chan)
  with End_of_file -> None

let rec fold_channel f acc chan =
  match read_line chan with
  | Some line -> fold_channel f (f line acc) chan
  | None -> acc

let values_of_line line =
  let fields = Str.split (Str.regexp ",") line in
  List.map float_of_string fields
  
let add_to totals values =
  match totals with
  | [] -> values
  | _  ->
     if List.length totals = List.length values then
       List.map2 (+.) totals values
     else failwith "Inconsistent-length rows"

let sum_channel chan =
  let folder line tot = add_to tot (values_of_line line) in
  fold_channel folder [] chan
	      
let sum_and_print_file filename =
  let chan = open_in filename in
  (let result = sum_channel chan in
   print_string ((String.concat "," (List.map string_of_float result)) ^ "\n");
   close_in chan)

let main () =
  match Sys.argv with
  | [| _; filename |] -> sum_and_print_file filename
  | _ -> failwith "Exactly 1 filename must be given"
    
let () = main ()

I’m in two minds about this code. I don’t much like the way it looks and reads. Syntax-wise there are an awful lot of lets; I prefer the way SML uses fun for top-level function declarations and saves let for scoped bindings. OCaml has a more extensive but scruffier library than SML and although there’s lots of documentation, I didn’t find it all that simple to navigate—as a result I’m not sure I’m using the most suitable tools here. There is probably a shorter simpler way. And my first attempt didn’t work for long files: caught out by the fact that input_line throws an exception at end of file (ugh), I broke tail-recursion optimisation by adding an exception handler.

On the other hand, writing this after the SML and Yeti versions, I found it very easy to write syntax that worked, even when I wasn’t quite clear in my head what the syntax was supposed to look like. (At one point I started to worry that the compiler wasn’t working, because it took no time to run and printed no errors.)

I didn’t spot at first that OCaml ships with separate bytecode and optimising native-code compilers, so my first tests seemed a bit slow. In fact it was very fast indeed:

$ time ocamlopt -o sum-ocaml str.cmxa sum.ml

real 0m0.073s
user 0m0.063s
sys 0m0.003s
$ time ./sum-ocaml big-data.csv 
150.595368855,68.9467923856,[...]

real 0m7.761s
user 0m7.740s
sys 0m0.027s
$

The OCaml native binary was 339K and depended only on libm, libdl, and libc.

Yeti

Yeti is an ML-derived language for the Java virtual machine. I’ve written about it a couple of times before.

valuesOfLine line =
    map number (strSplit "," line);
    
addTo totals row =
    if empty? totals then array row
    elif length totals == length row then array (map2 (+) totals row)
    else failWith "Inconsistent-length rows"
    fi;

rowsOfFile filename =
    readFile filename "UTF-8"
        do handle: map valuesOfLine (handle.lines ()) done;

sumFile filename =
    fold addTo (array []) (rowsOfFile filename);

sumAndPrintFile filename =
    println (strJoin "," (map string (sumFile filename)));

case (list _argv) of
     [filename]: sumAndPrintFile filename;
     _: failWith "Exactly 1 filename must be given";
esac

I love Yeti’s dismissive approach to function and binding declaration syntax—no let or fun keywords at all. Psychologically, this is great when you’re staring at an empty REPL prompt trying to decide where to start: no syntax to forget, the first thing you need to type is whatever it is that you want your function to produce.

The disadvantage of losing let and fun is that Yeti needs semicolons to separate bindings. It also makes for a visually rather irregular source file.

As OCaml is like a pragmatic SML, so Yeti seems like a pragmatic OCaml. It provides some useful tools for a task like this one. Although the language is eagerly evaluated, lazy lists have language support and are interchangeable with standard lists, so the standard library can expose the lines of a text file as a lazy list making a fold over it very straightforward. The default map and map2 functions produce lazy lists.

Unfortunately, this nice feature then bit me on the bottom in my first draft, as the use of a lazy map2 in line 6 blew the stack with large inputs (why? not completely sure yet). The standard library has an eager map as well as a lazy one but lacks an eager map2, so I fixed this by converting the number row to an array (arguably the more natural type for it).

The Yeti compiler runs very quickly and compiles to Java .class files. With a small program like this, I would usually just invoke it and have the compiler build and run it in one go:

$ time yc ./sum.yeti big-data.csv 
150.59536885458684,68.9467923856445,[...]

real 0m14.440s
user 0m26.867s
sys 0m0.423s
$

Those timings are interesting, because this is the only example to use more than one processor—the JVM uses a second thread for garbage collection. So it took more time than the MLton binary, but finished quicker…

F♯

F♯ is an ML-style language developed at Microsoft and subsequently open-sourced, with a substantial level of integration with the .NET platform and libraries.

(Edit: as with the OCaml example, you’ll find suggestions for alternative ways to write this in the comments below.)

let addTo totals row =
    match totals with
    | [||] -> row
    | _ ->
       if Array.length totals = Array.length row then
         Array.map2 (+) totals row
       else failwith "Inconsistent-length rows"

let sumOfFields fields =
    let rows = Seq.map (Array.map float) fields in
    Seq.fold addTo [||] rows

let fieldsOfFile filename = 
    seq { use s = System.IO.File.OpenText(filename)
          while not s.EndOfStream do yield s.ReadLine().Split ',' }

let sumAndPrintFile filename =
    let result = fieldsOfFile filename |> sumOfFields in
    printfn "%s" (String.concat "," (Array.map string result))

[<EntryPoint>]
let main argv = 
    match argv with
    | [|filename|] -> (sumAndPrintFile filename; 0)
    | _ -> failwith "Exactly 1 filename must be given"

F♯ also has language support for lazy lists, but with different syntax (they’re called sequences) and providing a Python-style yield keyword to generate them via continuations. The sequence generator here came from one of the example tutorials.

A lot of real F♯ code looks like it’s mostly plugging together .NET calls, and there are a lot of capital letters going around, but the basic functional syntax is almost exactly OCaml. It’s interesting that the fundamental unit of text output seems to be the formatted print (printfn). I gather F♯ programmers are fond of their |> operator, so I threw in one of those.

I’m running Linux so I used the open source edition of the F♯ compiler:

$ time fsharpc -o sum-fs.exe sum.fs
F# Compiler for F# 3.1 (Open Source Edition)
Freely distributed under the Apache 2.0 Open Source License

real 0m2.115s
user 0m2.037s
sys 0m0.063s
$ time ./sum-fs.exe big-data.csv 
150.595368854587,68.9467923856445,[...]

real 0m13.944s
user 0m13.863s
sys 0m0.070s
$

The compiler produced a mere 7680-byte .NET assembly, that of course (like Yeti) requires a substantial managed runtime. Performance seems pretty good.

Python

Python is not an ML-like language; I include it just for comparison.

import sys

def add_to(totals, values):
    n = len(totals)
    if n == 0:
        return values
    elif n == len(values):
        return [totals[i] + values[i] for i in range(n)]
    else:
        raise RuntimeError("Inconsistent-length rows")
        
def add_line_to(totals, line):
    values = [float(s) for s in line.strip().split(',')]
    return add_to(totals, values)

def sum_file(filename):
    f = open(filename, 'r')
    totals = []
    for line in f:
        totals = add_line_to(totals, line)
    f.close()
    return totals

if __name__ == '__main__':
    if len(sys.argv) != 2:
        raise RuntimeError("Exactly 1 filename must be given")
    result = sum_file(sys.argv[1])
    print(','.join([str(v) for v in result]))

Feels odd having to use the return keyword again, after using languages in which one just leaves the result at the end of the function.

This is compact and readable. A big difference from the above languages is invisible—it’s dynamically typed, without any compile-time type checking.

To build and run this, I just invoked Python on it:

$ time python ./sum.py ./big-data.csv 
150.59536885458684,68.9467923856445,[...]

real 0m10.939s
user 0m10.853s
sys 0m0.060s
$

That’s Python 3. Python 2 was about a second faster. I was quite impressed by this result, having expected to suffer from my decision to always return new lists of totals rather than updating the values in them.

Any conclusions?

Well, it was a fun exercise. Although I’ve written more in these languages than appears here, and read quite a bit about all of them, I’m still pretty ignorant about the library possibilities for most of them, as well as about the object support in OCaml and F♯.

I am naively impressed by the OCaml compiler. For language “feel”, it gave me the least favourable first impression but I can imagine it being pleasant to use daily.

F♯ on Linux proved unexpectedly straightforward (and fast). Could be a nice choice for web and server applications.

I have made small web and server applications using Yeti and enjoyed the experience. Being able to integrate with existing Java code is good, though of course doubly so when the available libraries in the language itself are so limited.

Standard ML has a clarity and simplicity I really like, and I’d still love to try to use it for something serious. It’s just, well, nobody else seems to—I bet quite a lot of people have learned the language as undergrads (as I did) but it doesn’t seem to be the popular choice outside it. Hardly anyone uses Yeti either, but the Java interoperability means you aren’t so dependent on other developers.

Practically speaking, for jobs like this, and where you want to run something yourself or give someone the source, there’s not much here to recommend anything other than Python. Of course I do appreciate both compile-time typechecking and (for some problems) a more functional style, which is why I’m writing this at all.

But the fact that compilers for both SML and OCaml can generate compact and quick native binaries is interesting, and Yeti and F♯ are notable for their engagement with other existing frameworks.

If you’ve any thoughts or suggestions, do leave a comment.

Conversing with the author of the Yeti programming language

I’m a great fan of the Yeti programming language, a JVM-based functional language which I think offers an irresistible combination of easy, fluent syntax with rigorous typechecking and good interoperability and performance.

Yeti is written by one developer, Madis Janson. I dropped him a note with a few questions about the origin of and his aims for the language, and he gave permission to post the conversation here (thanks!).

Chris: I’m interested in how you came to write Yeti…

Madis: The biggest reason was just having an idea one evening about basic syntax, which kind of seemed to fit.

C.: Which syntax was that?

M.: From October 2007…

foo = expr;              // binding - define foo
foo arg1 args... = expr; // binding - define function foo
{ binding; bindings }    // structure literal
[ expr; expr... ]        // list
(expr or binding; expr)  // sequence
foo.bar                  // structure field ref
foo bar                  // application
+,-,*,/                  // arithmetic
// module is simply constant value
// data types - primitives, list, structure, function

Basically only thing that has changed in this core language is ‘,’ instead ‘;’ in list and structure literals (and this was a very early change).

I had previously thought quite a lot about type systems and inference, so I had ready ideas to try about that too. So I just decided to try to write it and look what comes out.

C.: Did you have a particular field of use for it in mind?

M.: At the time I was mostly working on one enterprise Java system, probably Yeti is somewhat influenced by the problems I had at hand there. So Java web applications are probably the nearest to the particular field. At the same time it’s heavily influenced by perl text-manipulation capabilities.

Some influence was probably Scala, in which I was a bit disappointed. It can be said that Scala type system is genius (in mixing Java object hierarchy with expressive functional typesystem). The down side is that the complexity is somewhat overwhelming and the core language/library provides about ten roughly equivalent ways for doing anything (exaggeration, but still it feels like this). Scala code tends to be clever and compact, but not particularly easy to read (or even write). In the end I think, that trying to meld multiple radically different typesystem realm in this way is not worth the resulting complexity and confusion. Yeti took another approach in trying to isolate the Java interfacing – there are two typesystem subsets, “Yeti types” and “Java types”, and while working with Yeti types the Java typesystem can be ignored.

In the end basing Yeti on JVM at all has been blessing and curse at once, but it’s too late to change it.

C.: What aspects of the language itself (as opposed to implementation details) are influenced by limitations of the JVM?

M.: Yeti does tail call optimisation only inside single function (using JVM goto instruction) since JVM doesn’t support it. Type classes could have been implemented on JVM, but when I started writing Yeti, it seemed too complicated.

C.: I like the approach you took with the typesystem and the small number of core types in Yeti. One intriguing thing is the fact that there’s only one number type. Is that a question of necessity because type inference would be too fiddly otherwise, or a design decision?

M.: Two reasons. It allowed to avoid type classes altogether, and any complexity associated with implementing them. Multiple number types require either type classes for overloading, or separate operators for each numeric type. OCaml does the latter, which is kind of ugly imho.

Second reason is, that even with type classes and overloading, the strictness of HM style typesystem leads to requiring to use functions like float_of_int/fromEnum/toIntegral whenever the value is of “wrong” type. This usually won’t make the code any better and just clutters it. Since Yeti goal has been more simplicity and getting out-of-the way than being fastest language on the earth, it made sense to treat any kind of number as being just a number. The numeric tower system has been copied from Scheme.

Yeti tries to show, that you actually can have scripting-language style simplicity together with static typing. I think it has mostly succeeded in this, although there have been problems (mostly with error messages). You can prototype quite like in perl or python, and then just write type declarations wherever they seem useful for reading clarity. Differing with optional static typing, the code without declarations is still checked for being correctly typed – this is important if you want to rely on static typing with larger code base.

It’s not enough to have just experimental language to prove that this approach works – the language itself must be complete and stable enough that lot of complex code can and will be written in it, and experience gained through it.

That’s one of the current goals for maintaining Yeti, having a working proof that the distinction between dynamic and static languages is meaningless, that you can have the cake and eat it too (have most of the type-related errors discovered early with very little extra effort by developer, while retaining most of the relative simplicity of dynamic typing).

Yertle: an RDF/Turtle library in Yeti

I wrote a little while back about the Yeti programming language and the most pleasant time I was having with it, learning and re-learning some functional programming idioms.

One of the first things I set about writing—because I could think of a use for it and it seemed like a nice thing to make—was a parser for the RDF/Turtle language. I’m afraid I couldn’t resist calling my Yeti Turtle module “Yertle”, after the Dr Seuss turtle. You can find the code here on Bitbucket. It is also an in-memory datastore and a Turtle serialiser, but the parser is the complicated bit.

How I got on

I don’t recall ever having written a non-trivial parser before, so this was an exercise in learning how to do that, as well as practice in Yeti. One reason I thought a Turtle parser would be a good exercise is that it’s a simple, human-readable language that I (thought I) already knew.

In fact it’s not as simple as all that: the Turtle grammar is very straightforward, but it’s lexically disappointingly fiddly. And while I had thought it could be processed with only single-character lookahead and no backtracking, this turns out not to be the case.

The result is a recursive-descent parser without a preceding tokenizer, in which the parse functions return either partially constructed triples (complete triples being emitted directly into a state object) or an error variant.

Unwinding errors through parse function return values rather than, say, exceptions seems nice and tidy, but it does lead to a lot of nested decisions:

aOrPrefixedName state =
    case pnameCandidate state of
    OK tok:
        case tok of
        Token "a": OK rdfTypeIRI;
        Token t: prefixExpanded state t;
        esac;
    Error e:
        Error e;
    esac;

However, with this structure we can add a composition operator to apply a chain of parse functions, dropping out when we get one that returns an error:

(~>) a b =
    case a of
    OK result: b result;
    Error e: Error e;
    esac;

making our example

aOrPrefixedName state =
    pnameCandidate state ~>
       \case of
        Token "a": OK rdfTypeIRI;
        Token t: prefixExpanded state t;
        esac;

in which the second part of aOrPrefixedName (the case lambda expression) is evaluated only if the first part (pnameCandidate) succeeds and returns an OK (Token t) variant; otherwise it is short-circuited and the error returned immediately.

This seems to be a pretty basic construct in monadic parsers and the like—being clueless about parsers and even more clueless about monads, I was unaware of this, but this kind of thing drops out by itself when looking at how to simplify a potentially repetitive bit of code.

How well does it work?

The parser passes all the W3C spec tests, plus a few additional unit tests.

Yertle has about half the number of lines of code of David Robillard’s Serd, the cleanest C implementation I know. But Yertle uses quite a number of regular expressions, and I’m not sure I would argue that it’s more readable than Serd. On the other hand, it would certainly be even less readable if I had written it in C.

Yertle is fast enough to handle modest documents in many practical situations, but it isn’t all that fast. It currently takes about four times as long to parse and re-serialise a large Turtle file as rapper does and around ten times as long as the Serd serdi conversion utility. The times aren’t quite comparable because rapper and serdi don’t load to an indexed store on the way through (in fact serdi can convert arbitrarily large documents, which neither rapper nor Yertle can). And neither of those utilities, in the versions I compared with, quite implements the current W3C Turtle spec. But I don’t imagine those things make a huge difference.

(I decided beforehand I’d be happy to be within a factor of 5 of current C parser/serialisers, by which I really meant rapper. But as it turns out, now it’s written, I’d like to be a bit closer than that! I may have a bit of work to do, if only for my own satisfaction.)

Functional programming and the joy of learning something again

Twenty years ago, as a maths-and-computing undergraduate at the university of Bath, I was introduced to functional programming using the ML language by the excellent Julian Padget. We undergrads were set the traditional assignment of writing a sed-like text processor in ML, and found it first baffling and then, if we were lucky, rather exciting. I got all enthusiastic about functional languages and, like any proper enthusiast, spent a delightful pointless while trying to design a toy language of my own.

Then I went off and got a job as a C++ programmer.

C++ is a practical, useful language, but it’s also complex, verbose, and baroque, with many different schools of practice. I’ve done a fair bit of work in other languages as well (equally boring ones, like Java) but effectively, I’ve spent much of the past two decades simply continuing to learn C++. That’s a bit crazy.

So when messing about with Android recently, I decided I wanted to try to get some of that old sense of joy back. I went looking for a language with the following properties:

  • It should be primarily functional rather than object-oriented
  • It should be strongly-typed, ideally with Hindley-Milner typing (the most exciting thing about ML, for the undergraduate me)
  • It should have a compiler to JVM bytecode, so I could use it directly in Android apps, and should be able to use Java library code
  • It should have a REPL, an interactive evaluation prompt for productive messing around
  • It should be nice to read—it should be obviously a language I wanted to learn, and I was going to be happily guided by personal whim
  • It must be simple enough for the old, stupid me to have a chance of getting to grips with it
  • And, while I wasn’t going to care very much how mainstream it was, it did need to be reasonably complete and reliable.

There are lots of languages out there for the JVM, including quite a few functional ones. Scala and Clojure are the best-known.

Scala (here in Wikipedia) is a multi-paradigm language that, for me, has shades of C++ in that it feels like it’s designed to improve on all sorts of things in Java rather than be something simple of its own. It also looks object-oriented first and functional second; doesn’t prioritise interactive evaluation; and although it has a sophisticated type system, it doesn’t do inference on function parameter types. All in all, it just seemed a bit chunky to me.

Clojure (here in Wikipedia) looks more fun. It has a very vibrant community and seems well-loved. It’s basically a Lisp for the JVM, and I’ve written Lisp before. That’s definitely interactive and functional. But I wasn’t really setting out to find Lisp again.

Yeti

Having sifted through a few other possibilities, the one that really seemed to fit the bill was Yeti.

Yeti is a functional language with Hindley-Milner type inference, for the JVM, with a relatively simple syntax and interoperability with Java, that has an interactive REPL up front. (See the snappy tutorial.) It seems to be basically the work of one programmer, but a programmer with taste.

The syntax of Yeti looks a bit like the way I remember ML—although on reviewing ML, it turned out not to be as similar as I’d thought. Functions are defined and applied with just about the simplest possible syntax, and the language deduces the types of all values except Java objects. The lack of commas in function application syntax makes it obvious how to do partial application, a straightforward way to obtain closures (functions with context).

Here’s a trivial function, an application of it, and a partial application. The lines starting > are my typing, and the others are type deductions returned by the evaluation environment. Comments start //, as in Java.

> f a b = a + b   // f is a function taking two arguments
f is number -> number -> number = <code$f>
> f 2 3           // apply f to two numbers
5 is number
> f 2             // apply f to one number, returning a new function
<yeti.lang.Fun2_> is number -> number
> g = f 2         // assign that function to g
g is number -> number = <yeti.lang.Fun2_>
> g 3             // apply g to the second number
5 is number

So far, so academic toy language. But the more I apply Yeti to practical problems, the more I find it does work as a practical language.

What is challenging, of course, is that every language or language family has its set of idioms for handling everyday problems, and on the whole I simply don’t know those idioms yet in a functional language. This is the first time I’ve really tried to do anything serious with one. I know the language, roughly, but I don’t really speak the language. I’m still leafing through the phrasebook. My hovercraft is still full of eels.

With most of my incredibly stupid questions on the Yeti mailing list—which get very patient responses, but I really do need to cut back on the late-night stupidity and start reviewing my code in the morning instead—the answer turns out to be, “it’s simpler than I thought”. And I’m really enjoying it.

Why type inference?

A footnote. Why did I want a language with type inference?

Well, I’m lazy of course, and one of the most obvious sources of tedium in C++ and Java is having to type everything out over and over again.

And I’m a bit nostalgic about those undergrad days, no doubt.

But also, I’m increasingly mistrustful of my own work. In languages such as Python and Objective-C the concept of duck typing is widely used. This essentially means employing objects on the basis of their supported methods rather than their nominal class (“if it walks like a duck…”). This relaxed approach reaches a bit of a pinnacle in Ruby on Rails, which I’ve been working with a bit recently—and I find the magic and the assumptions go a bit far for me. I like to have some of the reassurance of type checking.

So, type inference gives you—in theory—the best of both worlds. You get to write your code using duck-typing principles, and the compiler proof-reads for you and checks that your types really do work out.

That’s the theory. Does it scale? Not sure. And if it was so great, wouldn’t it have become more mainstream during the past 30 years? Some moderately widely-used languages, like Haskell, use it, but they’re still only moderately widely-used. So, we’ll see.

There are some obvious immediate disadvantages to type inference. Long error messages, for a start.

And as a C++ guy I somewhat miss function overloading and even (sometimes) operator overloading. A function argument can take more than one type, of course—that’s the whole point—but only if the types can be unified in terms of their properties; you can’t just reuse a function name for a second function that has a similar effect but works on unrelated types.

Most of the situations in which I want function overloading can be handled instead using variant types or module namespaces, both of which work well in Yeti, but sometimes it seems to come down to inventing slightly more awkward function names than I’d really like.