Fold: at the limit of comprehension

Fold” is a programming concept, a common name for a particular higher-order function that is widely used in functional programming languages. It’s a fairly simple thing, but in practice I think of it as representing the outer limit of concepts a normal programmer can reasonably be expected to grasp in day-to-day work.

What is fold? Fold is an elementary function for situations where you need to keep a tally of things. If you have a list of numbers and you want to tally them up in some way, for example to add them together, fold will do that.

Fold is also good at transforming sequences of things, and it can be used to reverse a list or modify each element of a sequence.

Fold is a useful fundamental function, and it’s widely used. I like using it. I just scanned about 440,000 lines of code (my own and other people’s) in ML-family languages and found about 14,000 that either called or defined a fold function.

Let me try to describe fold more precisely in English: It acts upon some sort of iterable object or container. It takes another function as an argument, one that the caller provides, and it calls that function repeatedly, providing it with one of the elements of the container each time, in order, as well as some sort of accumulator value. That function is expected to return an updated version of the accumulator each time it’s called, and that updated version gets passed in to the next call. Having called that function for every element, fold then returns the final value of the accumulator.

I tried, but I think that’s quite hard to follow. Examples are easier. Let’s add a list of numbers in Standard ML, by folding with the “+” function and an accumulator that starts at zero.

> val numbers = [1,2,3,4,5];
val numbers = [1, 2, 3, 4, 5]: int list
> foldl (op+) 0 numbers;
val it = 15: int

What’s difficult about fold?

  1. Fold is conceptually tricky because it’s such a general higher-order function. It captures a simple procedure that is common to a lot of actions that we are used to thinking of as distinct. For example, it can be used to add up a list of numbers, reverse a list of strings, increase all of the numbers in a sequence, calculate a ranking score for the set of webpages containing a search term, etc. These aren’t things that we habitually think of as similar actions, other than that they happen to involve a list or set of something. Especially, we aren’t used to giving a name to the general procedure involved and then treating individual activities of that type as specialisations of it. This is often a problem with higher-order functions (and let’s not go into monads).
  2. Fold is syntactically tricky, and its function type is confusing because there is no obvious logic determining either the order of arguments given to fold or the order of arguments accepted by the function you pass to it. I must have written hundreds of calls to fold, but I still hesitate each time to recall which order the arguments go in. Not surprising, since the argument order for the callback function differs between different languages’ libraries: some take the accumulator first and value second, others the other way around.
  3. Fold has several different names (some languages and libraries call it reduce, or inject) and none of them suggests any common English word for any of the actions it is actually used for. I suppose that’s because of point 1: we don’t name the general procedure. Fold is perhaps a marginally worse name than reduce or inject, but it’s still probably the most common.
  4. There’s more than one version of fold. Verity Stob cheekily asks “Do you fold to left or to the right? Do not provide too much information.” Left and right fold differ in the order in which they iterate through the container, so they usually produce different results, but there can also be profound differences between them in terms of performance and computability, especially when using lazy evaluation. This means you probably do have to know which is which. (See footnote below.)

A post about fold by James Hague a few years ago asked, “Is the difficulty many programmers have in grasping functional programming inherent in the basic concept of non-destructively operating on values, or is it in the popular abstractions that have been built-up to describe functional programming?” In this case I think it’s both. Fold is a good example of syntax failing us, and I think it’s also inherently a difficult abstraction to recognise (i.e. to spot the function application common to each activity). Fold is a fundamental operation in much of functional programming, but it doesn’t really feel like one because the abstraction is not comfortable. But besides that, many of the things fold is useful for are things that we would usually visualise in destructive terms: update the tally, push something onto the front of the list.

In Python the fold function (which Python calls reduce) was dropped from the built-in functions and moved into a separate module for Python 3. Guido van Rossum wrote, “apart from a few examples involving + or *, almost every time I see a reduce() call with a non-trivial function argument, I need to grab pen and paper to diagram what’s actually being fed into that function before I understand what the reduce() is supposed to do.” Instead the Python style for these activities usually involves destructively updating the accumulator.

Functional programming will surely never be really mainstream so long as fold appears in basic tutorials for it. Though in practice at least, because it’s such a general function, it can often be usefully hidden behind a more discoverable domain-specific API.

***

(Footnote. You can tell whether an implementation of fold is a left or right fold by applying it to the list “cons” function, which is often called “::”. If this reverses a list passed to it, you have a left fold. For example, the language Yeti has a function simply called fold; which is it? —

> fold (flip (::)) [] [1,2,3,4];
[4,3,2,1] is list<number>

So it’s a left fold.)

 

Zero-based indexing

The excellent Greg Wilson, founder of Software Carpentry, tweeted the above link to a 2013 blog post by Mike Hoye the other day.

I didn’t comment on this article when it first appeared because I didn’t have the nerve to confront its author, who was shouting down everyone who tried to discuss it in the comments. But I can’t bear to see this article promoted again, and by a good authority.

The article claims that the reason most of today’s programming languages use zero-based indexing (i.e. they count array indexes from 0, so that arr[0] is the first element of array arr, rather than arr[1]) is because it saved a tiny amount of compile time (not run time), and that this mattered because on a specific IBM mainframe hosted by MIT in the 70s there was a danger that a job taking too long to compile might be bumped in order to make way for a program to calculate handicap points for yacht racing.

This is a pretty implausible suggestion, so it needs some pretty good evidence. That isn’t there. The article has some very nice sources, but the quotes from them just don’t support the proposition they’re being asked to support. The main quotes, from Martin Richards and Tom Van Vleck, both appear to say nearly the opposite of the things they’re described as saying. There’s plenty of room for nuance in interpreting in what people say, but the author accepts no nuance in anyone’s responses to the article, choosing instead to mock and ridicule anyone who doesn’t agree with him. There’s no citation for the one thing that is necessary to make the argument hold together (that indexes were calculated at compile time rather than run time). Reading this article carefully, the only conclusion I can draw is that the choice of 0-based indexing almost certainly has nothing to do with yachts.

I don’t mind a great deal whether a programming language uses 0-based or 1-based indexing. The reason this matters to me is because the article is not just a screed with a funny story in it, but a call for rigour in understanding the history of programming languages, something I do care about and that its author appears to take very seriously indeed. Its general principle is really sound — we get used to a lot of arbitrary aspects of languages and then explain them as the mythology of the elders, rather than finding the actual reasons. But this article only added to the mythology, and people who know better are now citing it as if it had been established to be true, which it almost certainly isn’t.

(I feel really bad just writing this. It’s quite possible the author is regretting ever getting involved in this stupid topic but has too much integrity to take down or edit the post. I wish I had never been reminded of how maddening I found it.)

 

Bowie

Here’s a playlist of good David Bowie songs that I had never heard until after he died last week.

Spotify playlist
YouTube links:
Dead Against It (1993)
Up The Hill Backwards (1980)
Move On (1979)
Dancing With The Big Boys (1984) (with Iggy Pop)
I Would Be Your Slave (2002)
Girl Loves Me (2016)
You’ve Been Around [Dangers 12″] (1993) (Jack Dangers remix)
Nite Flights (1993) (Scott Walker cover)
No Control (1995)
Bring Me The Disco King (2003)
I’m Deranged (1995)
5:15 The Angels Have Gone (2002)

Most of these came out after the peak of his popularity, but they aren’t obscure at all — I was just never a fan.

The first Bowie songs I remember hearing on the radio were Modern Love and Let’s Dance, both released in 1983 when I was eleven. I thought those two were fine, though they weren’t the sort of thing I liked to think I was into at the time. (I had a big Motörhead war-pig patch on my little denim jacket. Lemmy’s death was also the end of an era.)

A few years later, a cousin introduced me to some of the Spiders from Mars period songs like Rebel Rebel and Suffragette City. I was a bit puzzled because I thought I knew Bowie as a smooth, modern 80s-sounding chap. But I didn’t get the appeal either: too much like everything else from the early 70s. Rebel Rebel even sounded like the Stones, which was definitely my dad’s music.

Back in the real timeline of the 80s, Bowie was releasing Never Let Me Down, an album seen everywhere (one of several awful record covers) but seldom played, then launching the drearily adult Tin Machine.

His next album, Black Tie White Noise, didn’t come out until 1993, when I was briefly in Berlin as a student and mostly listening to industrial music and obscure things I read about in Usenet groups. If I had been aware that David Bowie had an album out, I would certainly have ignored it. By the time of 1997’s Earthling, a jungle-influenced album released a whole two years after peak jungle with a dreadful Cool Britannia cover, it felt socially impossible to admit to liking a Bowie song ever again. And that was pretty much the end of that.

There’s been a David Bowie album, collaboration, tour, or retrospective for almost every year of my life, and I’ve never taken more than a passing interest in any of them.

I was taken by surprise, then, by how emotional I felt about his death.

***

What did eventually make me notice David Bowie as a figure was the connection with Iggy Pop. I think Iggy is brilliant, and I’d been a bit of a fan for a while before I eventually twigged what it was that his most interesting stuff had in common. That made me aware of the famously dramatic and productive spell for those two in Berlin the late 70s (the only albums of Bowie’s that I ever actually bought are from this period) and also an opening to a bit of a web of interesting collaborations and influences.

(Going back during the past week and filling in a lot of the songs of Bowie’s that I’ve missed during the last few decades, it’s been particularly fun to hear Iggy Pop numbers, er, pop up all over the place. China Girl — always an Iggy song to me — is well known, but there are at least three other albums that recycle songs previously recorded by him, including a straight cover of the flop lead single from Iggy’s most foolish album. A sustained friendship.)

***

So something of the emotion for me has to do with all that Berlin stuff. There are two aspects to that. One is the grubbily romantic idea of “pressure-cooker” West Berlin, seen from a distance as a place of hideouts, drugs, spying, longing, separation, and any other melodrama that “we” could project onto it. I’m sure this version of the city was overstated for lyrical purposes, but it probably did exist to a degree. The Berlin that fascinated and frightened me in 1993 was already a very different city, and both versions are hardly visible in today’s shiny metropolis.

The other aspect is the notion that moving to a different town in a different country could give you a new life and make your past disappear, even for someone already so celebrated — that it could really be so simple. What makes that idea available here is that Bowie didn’t just go, but then produced such different work after going that it really could appear as if his past had not gone with him.

This impression of self-effacement alongside all the self-promotion, the ability to erase the past, is a very attractive one for a pop star, and it fits also with the amount of collaborative work Bowie did. From some of the videos you can imagine that he was never happier than when playing keyboards or doing tour production for Iggy, singing backing vocals in a one-off with Pink Floyd, or playing second fiddle to Freddie Mercury or Mick Jagger.

Perl 6

I see the official release of the Perl 6 language specification happened on Christmas day.

The first piece of commercial web development I did was in Perl 5. A lot of people can probably say the same thing. This one was a content-management system led by James Elson in 1999 at PSWeb Ltd, a small agency in Farringdon that renamed itself to Citria and expanded rapidly during 1999-2001 before deflating even more rapidly when the dotcom bust arrived.

My recollection was that this particular CMS only ever had one customer, the BBC, who used it only for their very small Digital Radio site. But I still have a copy of the code and on inspection it turns out to have some comments that must have been added during a later project, so perhaps it did get deployed elsewhere. It was a neat, unambitious system (that’s a good thing, James is a tasteful guy) that presented a dynamic inline-editing blocks-based admin interface on a backend URL while generating static pages at the front end.

I remember there was an open question, for a time, about whether the company should pursue a product strategy and make this first CMS, or something like it, the basis of its business, or else take up a project strategy and use whatever technology from whichever provider seemed most appropriate for each client. The latter approach won out. It’s interesting to speculate about the other option.

(I like to imagine that the release of Perl 6 is sparking tiresome reminiscences like this from ageing programmers across the world.)

Perl 6 looks like an interesting language. (It’s a different language from Perl 5, not compatible with it.) The great strength of Perl was of course its text-processing capacity, and for all the fun/cute/radically-unmaintainable syntax showcased on the Perl 6 site, it’s clear that that’s still a big focus. Perl 6 appears to be the first language to attempt to do the right thing with Unicode from the ground up: that is, to represent text as Unicode graphemes, rather than byte strings (like C/C++), awkward UTF-16 sequences (Java), or ISO-10646 codepoint sequences (Python 3). This could, in principle, mean your ad-hoc botched-together text processing script in Perl 6 has a better chance of working reliably across multilingual inputs than it would in any other language. Since plenty of ad-hoc botched-together text processing still goes on in companies around the world, that has the feel of a good thing.

MIREX 2015 submissions

For the past three years now, we’ve taken a number of Vamp audio analysis plugins published by the Centre for Digital Music and submitted them to the annual MIREX evaluation. The motivation is to give other methods a baseline to compare against, to compare one year’s evaluation metrics and datasets against the next year’s, and to give our group a bit of visibility. See my posts about this process in 2014 and in 2013.

Here are this year’s outcomes. All these categories are ones we had submitted to before, but I managed to miss a couple of category deadlines last year, so in total we had more categories than in either 2013 or 2014.

Structural Segmentation

Results for the four datasets are here, here, here, and here. This is one of the categories I missed last year and, although I find the evaluations quite hard to understand, it’s clear that the competition has moved on a bit.

Our own submissions, the Segmentino plugin from Matthias Mauch and the much older QM Segmenter from Mark Levy, produced the expected results (identical to 2013 for Segmentino; similar for QM Segmenter, which has a random initialisation step). As before, Segmentino obtains the better scores. There was only one other submission this year, a convolutional neural network based approach from Thomas Grill and Jan Schlüter which (I think) outperformed both of ours by some margin, particularly on the segment boundary measure.

Multiple Fundamental Frequency Estimation and Tracking

Results here and here. In addition to last year’s submission for the note tracking task of this category, this year I also scripted up a submission for the multiple fundamental frequency estimation task. Emmanouil Benetos and I had made some tweaks to the Silvet plugin during the year, and we also submitted a new fast “live mode” version of it. The evaluation also includes a new test dataset this year.

Our updated Silvet plugin scores better than last year’s version in every test they have in common, and the “live mode” version is actually not all that far off, considering that it’s very much written for speed. (Nice to see a report of run times in the results page — Silvet live mode is 15-20 times as fast as the default Silvet mode and more than twice as fast as any other submission.) Emmanouil’s more recent research method does substantially better, but this is still a pleasing result.

This category is an extremely difficult one, and it’s also painfully difficult to get good test data for it. There’s plenty of potential here, but it’s worth noting that a couple of the authors of the best submissions from last year were not represented this year — in particular, if Elowsson and Friberg’s 2014 method had appeared again this year, it looks as if it would still be at the top.

Audio Onset Detection

Results here. Although the top scores haven’t improved since last year, the field has broadened a bit — it’s no longer only Sebastian Böck vs the world. Our two submissions, both venerable methods, are now placed last and second-last.

Oddly, our OnsetsDS submission gets slightly better results than last year despite being the same, deterministic, implementation (indeed exactly the same plugin binary) run on the same dataset. I should probably check this with the task captain.

Audio Beat Tracking

Results here, here, and here. Again the other submissions are moving well ahead and our BeatRoot and QM Tempo Tracker submissions, producing unchanged results from last year and the year before, are now languishing toward the back. (Next year will see BeatRoot’s 15th birthday, by the way.) The top of the leaderboard is largely occupied by a set of neural network based methods from Sebastian Böck and Florian Krebs.

This is a more interesting category than it gets credit for, I think — still improving and still with potential. Some MIREX categories have very simplistic test datasets, but this category introduced an intentionally difficult test set in 2012 and it’s notable that the best new submissions are doing far better here than the older ones. I’m not quite clear on how the evaluation process handles the problem of what the ground truth represents, and I’d love to know what a reasonable upper bound on F-measure might be.

Audio Tempo Estimation

Results here. This is another category I missed last year, but we get the same results for the QM Tempo Tracker as we did in 2013. It still does tolerably well considering its output isn’t well fitted to the evaluation metric (which rewards estimators that produce best and second-best estimates across the whole piece).

The top scorer here is a neural network approach (spotting a theme here?) from Sebastian Böck, just as for beat tracking.

Audio Key Detection

Results here and here. The second dataset is new.

The QM Key Detector gets the same results as last year for the dataset that existed then. It scores much worse on the new dataset, which suggests that may be a more realistic test. Again there were no other submissions in this category — a pity now that it has a second dataset. Does nobody like key estimation? (I realise it’s a problematic task from a musical point of view, but it does have its applications.)

Audio Chord Estimation

Poor results for Chordino because of a bug which I went over at agonising length in my previous post. This problem is now fixed in Chordino v1.1, so hopefully it’ll be back to its earlier standard in 2016!

Some notes

Neural networks

… are widely-used this year. Several categories contained at least one submission whose abstract referred to a convolutional or recurrent neural network or deep learning, and in at least 5 categories I think a neural network method can reasonably be said to have “won” the category. (Yes I know, MIREX isn’t a competition…)

  • Structural segmentation: convolutional NN performed best
  • Beat tracking: NNs all over the place, definitely performing best
  • Tempo estimation: NN performed best
  • Onset detection: NN performed best
  • Multi-F0: no NNs I think, but it does look as if last year’s “deep learning” submission would have performed better than any of this year’s
  • Chord estimation: NNs present, but not yet quite at the top
  • Key detection: no NNs, indeed no other submissions at all

Categories I missed

  • Audio downbeat estimation: I think I just overlooked this one, for the second year in a row. As last year, I should have submitted the QM Bar & Beat Tracker plugin from Matthew Davies.
  • Real-time audio to score alignment: I nearly submitted the MATCH Vamp Plugin for this, but actually it only produces a reverse path (offline alignment) and not a real-time output, even though it’s a real-time-capable method internally.

Other submissions from the Centre for Digital Music

Emmanouil Benetos submitted a well-performing method, mentioned above, in the Multiple Fundamental Frequency Estimation & Tracking category.

Apart from that, there appear to be none.

This feels like a pity — evaluation is always a pain and it’s nice to get someone else to do some of it.

It’s also a pity because several of the plugins I’m submitting are getting a bit old and are falling to the bottom of the results tables. There are very sound reasons for submitting them (though I may drop some of the less well performing categories next year, assuming I do this again) but it would be good if they didn’t constitute the only visibility QM researchers have in MIREX.

Why would this be the case? I don’t really know. The answer presumably must include some or all of

  • not working on music informatics signal-processing research at all
  • working on research that builds on feature extractors, rather than building the feature extractors themselves
  • research not congruent with MIREX tasks (e.g. looking at dynamics or articulations rather than say notes or chords)
  • research uses similar methods but not on mainstream music recordings (e.g. solo singing, animal sounds)
  • state-of-the-art considered good enough
  • lack the background to compete with current methods (e.g. the wave of NNs) and so sticking with progressive enhancements of existing models
  • lack the data to compete with current methods
  • not aware of MIREX
  • not prioritised by supervisor

The last four reasons would be a problem, but the rest might not be. It could really be that MIREX isn’t very relevant to the people in this group at the moment. I’ll have to see what I can find out.

Chordino troubles

On September the 9th, I released a v1.0 build of the Chordino and NNLS Chroma Vamp plugin. This plugin analyses audio recordings of music and calculates some harmonic features, including an estimated chord transcription. When used with Sonic Visualiser, Chordino is potentially very useful for anyone who likes to play along with songs, as well as for research.

Chordino was written by Matthias Mauch, based on his own research. Although I made this 1.0 release, my work on it only really extended as far as fixing some bugs found in earlier releases using the Vamp Plugin Tester.

Unfortunately, with one of those fixes, I broke the plugin. The supposedly more reliable 1.0 update was substantially less accurate at identifying both chord-change boundaries and the chords themselves than any previous version.

I didn’t notice. Nor did Matthias, who had recently left our research group and was busy starting at a new job. One colleague sent me an email saying he had problems with the new release, but I jumped to the completely wrong conclusion that this had something to do with parameter settings having changed since the last release, and suggested he raise it with Matthias.

I only realised what had happened after we submitted the plugin to MIREX evaluation, something we do routinely every year for plugins published by C4DM, when the MIREX task captain Johan Pauwels emailed to ask whether I had expected its scores to drop by 15% from the previous year’s submission (see results pages for 2014, 2015). By that time, the broken plugin had been available for over a month.

This is obviously hugely embarrassing—perhaps the most unambiguous screwup of my whole programming career so far. As the supposed professional software developer of my research group, I took someone else’s working code, broke it, published and promoted the broken version with his name all over it, and then submitted it to a public evaluation, again with his name on it, where its brokenness was finally made pointed out to me, a month later, by someone else. Any regression test, even on only a single audio file, would have shown up the problem immediately. Regression-testing this sort of software can be tricky, but the simplest possible test would have worked here. And a particularly nice irony is provided by the fact that I’ve just come from a four-year project whose aims included trying to improve the way software is tested in academia.

I’ve now published a fixed version of the plugin (v1.1), available for download here. This one has been regression tested against known-good output, and the tests are in the repository for future use. The broken version is actually gone from the download page (though of course it is still tagged in the source repository), to avoid anyone getting the wrong one by accident.

I’m also working on a way to make simple regression tests easier to provide and run, for the other plugins I work on.

That’s all for the “public service announcement” bit of this post; read on only if you’re interested in the details.

What was the change that broke it? Well, it was a change I made after running the plugin through the Vamp Plugin Tester, a sort of automated fuzz-testing tool that helps you find problems with your code. (Again, there’s an irony here. Using this tool is undoubtedly a good practice, as it can show up all sorts of problems that might not be apparent to developers otherwise. Even so, I should have known well how common it is to introduce bugs while fixing things like compiler warnings and static analysis tests.)

The problem I was trying to fix here was that intermediate floating-point divisions sometimes overflowed, resulting in infinity values in the output. This only happened for unusual inputs, so it appeared reasonable to fix it by clamping intermediate values when they appeared to be blowing up out of the expected range. But I set the threshold too low, so that many intermediate values from legitimate inputs were also being mangled. I then also made a stupid typo that made the results a bit worse still (you can see the change in question around line 500 of the file in this diff).

Note that this only broke the output from the Chordino chord estimator, not the other features calculated by NNLS Chroma.

A digression. An ongoing topic of debate in the world of the Research Software Engineer is whether software development resources in academia should be localised within research groups, or centralised.

The localised approach, which my research group has taken with my own position, employs developers directly within a research subject. The centralised approach, typified by the Research Software Development group at UCL, proposes a group of software developers who are loaned or hired out to research groups according to need and availability.

In theory, the localised approach can be simpler to manage and should increase the likelihood of developers being available to help with small pieces of work requiring subject knowledge at short notice. The centralised approach has the advantage that all developers can share the non-subject-specific parts of their workload and knowhow.

I believe that in general a localised approach is useful, and I suspect it is easier to hire developers for a specific research group than to find developers good enough to be able to parachute in to anywhere from a central team.

In a case like this, though, the localised approach makes for quite a lonely situation.

Companies that produce large software products that work do so not because they employ amazing developers but because they have systems in place to support them: code review, unit testing, regression tests, continuous integration, user acceptance tests.

But for me as a lone professional developer in a research group, it’s essentially my responsibility to provide those safety nets as well as to use them. I had some of them in place for most of the code I work on, but there was a big hole for this particular project. I broke the code, and I didn’t notice because I didn’t have the right tests ready. Neither did the researcher who wrote most of this code, but that wasn’t his job. When some software goes out from this group that I have worked on, it’s my responsibility to make sure that the code aspects of it (as opposed to the underlying methods) work correctly. Part of my job has to be to assume that nobody else will be in a position to help.