MIREX 2015 submissions

For the past three years now, we’ve taken a number of Vamp audio analysis plugins published by the Centre for Digital Music and submitted them to the annual MIREX evaluation. The motivation is to give other methods a baseline to compare against, to compare one year’s evaluation metrics and datasets against the next year’s, and to give our group a bit of visibility. See my posts about this process in 2014 and in 2013.

Here are this year’s outcomes. All these categories are ones we had submitted to before, but I managed to miss a couple of category deadlines last year, so in total we had more categories than in either 2013 or 2014.

Structural Segmentation

Results for the four datasets are here, here, here, and here. This is one of the categories I missed last year and, although I find the evaluations quite hard to understand, it’s clear that the competition has moved on a bit.

Our own submissions, the Segmentino plugin from Matthias Mauch and the much older QM Segmenter from Mark Levy, produced the expected results (identical to 2013 for Segmentino; similar for QM Segmenter, which has a random initialisation step). As before, Segmentino obtains the better scores. There was only one other submission this year, a convolutional neural network based approach from Thomas Grill and Jan Schlüter which (I think) outperformed both of ours by some margin, particularly on the segment boundary measure.

Multiple Fundamental Frequency Estimation and Tracking

Results here and here. In addition to last year’s submission for the note tracking task of this category, this year I also scripted up a submission for the multiple fundamental frequency estimation task. Emmanouil Benetos and I had made some tweaks to the Silvet plugin during the year, and we also submitted a new fast “live mode” version of it. The evaluation also includes a new test dataset this year.

Our updated Silvet plugin scores better than last year’s version in every test they have in common, and the “live mode” version is actually not all that far off, considering that it’s very much written for speed. (Nice to see a report of run times in the results page — Silvet live mode is 15-20 times as fast as the default Silvet mode and more than twice as fast as any other submission.) Emmanouil’s more recent research method does substantially better, but this is still a pleasing result.

This category is an extremely difficult one, and it’s also painfully difficult to get good test data for it. There’s plenty of potential here, but it’s worth noting that a couple of the authors of the best submissions from last year were not represented this year — in particular, if Elowsson and Friberg’s 2014 method had appeared again this year, it looks as if it would still be at the top.

Audio Onset Detection

Results here. Although the top scores haven’t improved since last year, the field has broadened a bit — it’s no longer only Sebastian Böck vs the world. Our two submissions, both venerable methods, are now placed last and second-last.

Oddly, our OnsetsDS submission gets slightly better results than last year despite being the same, deterministic, implementation (indeed exactly the same plugin binary) run on the same dataset. I should probably check this with the task captain.

Audio Beat Tracking

Results here, here, and here. Again the other submissions are moving well ahead and our BeatRoot and QM Tempo Tracker submissions, producing unchanged results from last year and the year before, are now languishing toward the back. (Next year will see BeatRoot’s 15th birthday, by the way.) The top of the leaderboard is largely occupied by a set of neural network based methods from Sebastian Böck and Florian Krebs.

This is a more interesting category than it gets credit for, I think — still improving and still with potential. Some MIREX categories have very simplistic test datasets, but this category introduced an intentionally difficult test set in 2012 and it’s notable that the best new submissions are doing far better here than the older ones. I’m not quite clear on how the evaluation process handles the problem of what the ground truth represents, and I’d love to know what a reasonable upper bound on F-measure might be.

Audio Tempo Estimation

Results here. This is another category I missed last year, but we get the same results for the QM Tempo Tracker as we did in 2013. It still does tolerably well considering its output isn’t well fitted to the evaluation metric (which rewards estimators that produce best and second-best estimates across the whole piece).

The top scorer here is a neural network approach (spotting a theme here?) from Sebastian Böck, just as for beat tracking.

Audio Key Detection

Results here and here. The second dataset is new.

The QM Key Detector gets the same results as last year for the dataset that existed then. It scores much worse on the new dataset, which suggests that may be a more realistic test. Again there were no other submissions in this category — a pity now that it has a second dataset. Does nobody like key estimation? (I realise it’s a problematic task from a musical point of view, but it does have its applications.)

Audio Chord Estimation

Poor results for Chordino because of a bug which I went over at agonising length in my previous post. This problem is now fixed in Chordino v1.1, so hopefully it’ll be back to its earlier standard in 2016!

Some notes

Neural networks

… are widely-used this year. Several categories contained at least one submission whose abstract referred to a convolutional or recurrent neural network or deep learning, and in at least 5 categories I think a neural network method can reasonably be said to have “won” the category. (Yes I know, MIREX isn’t a competition…)

  • Structural segmentation: convolutional NN performed best
  • Beat tracking: NNs all over the place, definitely performing best
  • Tempo estimation: NN performed best
  • Onset detection: NN performed best
  • Multi-F0: no NNs I think, but it does look as if last year’s “deep learning” submission would have performed better than any of this year’s
  • Chord estimation: NNs present, but not yet quite at the top
  • Key detection: no NNs, indeed no other submissions at all

Categories I missed

  • Audio downbeat estimation: I think I just overlooked this one, for the second year in a row. As last year, I should have submitted the QM Bar & Beat Tracker plugin from Matthew Davies.
  • Real-time audio to score alignment: I nearly submitted the MATCH Vamp Plugin for this, but actually it only produces a reverse path (offline alignment) and not a real-time output, even though it’s a real-time-capable method internally.

Other submissions from the Centre for Digital Music

Emmanouil Benetos submitted a well-performing method, mentioned above, in the Multiple Fundamental Frequency Estimation & Tracking category.

Apart from that, there appear to be none.

This feels like a pity — evaluation is always a pain and it’s nice to get someone else to do some of it.

It’s also a pity because several of the plugins I’m submitting are getting a bit old and are falling to the bottom of the results tables. There are very sound reasons for submitting them (though I may drop some of the less well performing categories next year, assuming I do this again) but it would be good if they didn’t constitute the only visibility QM researchers have in MIREX.

Why would this be the case? I don’t really know. The answer presumably must include some or all of

  • not working on music informatics signal-processing research at all
  • working on research that builds on feature extractors, rather than building the feature extractors themselves
  • research not congruent with MIREX tasks (e.g. looking at dynamics or articulations rather than say notes or chords)
  • research uses similar methods but not on mainstream music recordings (e.g. solo singing, animal sounds)
  • state-of-the-art considered good enough
  • lack the background to compete with current methods (e.g. the wave of NNs) and so sticking with progressive enhancements of existing models
  • lack the data to compete with current methods
  • not aware of MIREX
  • not prioritised by supervisor

The last four reasons would be a problem, but the rest might not be. It could really be that MIREX isn’t very relevant to the people in this group at the moment. I’ll have to see what I can find out.

Chordino troubles

On September the 9th, I released a v1.0 build of the Chordino and NNLS Chroma Vamp plugin. This plugin analyses audio recordings of music and calculates some harmonic features, including an estimated chord transcription. When used with Sonic Visualiser, Chordino is potentially very useful for anyone who likes to play along with songs, as well as for research.

Chordino was written by Matthias Mauch, based on his own research. Although I made this 1.0 release, my work on it only really extended as far as fixing some bugs found in earlier releases using the Vamp Plugin Tester.

Unfortunately, with one of those fixes, I broke the plugin. The supposedly more reliable 1.0 update was substantially less accurate at identifying both chord-change boundaries and the chords themselves than any previous version.

I didn’t notice. Nor did Matthias, who had recently left our research group and was busy starting at a new job. One colleague sent me an email saying he had problems with the new release, but I jumped to the completely wrong conclusion that this had something to do with parameter settings having changed since the last release, and suggested he raise it with Matthias.

I only realised what had happened after we submitted the plugin to MIREX evaluation, something we do routinely every year for plugins published by C4DM, when the MIREX task captain Johan Pauwels emailed to ask whether I had expected its scores to drop by 15% from the previous year’s submission (see results pages for 2014, 2015). By that time, the broken plugin had been available for over a month.

This is obviously hugely embarrassing—perhaps the most unambiguous screwup of my whole programming career so far. As the supposed professional software developer of my research group, I took someone else’s working code, broke it, published and promoted the broken version with his name all over it, and then submitted it to a public evaluation, again with his name on it, where its brokenness was finally made pointed out to me, a month later, by someone else. Any regression test, even on only a single audio file, would have shown up the problem immediately. Regression-testing this sort of software can be tricky, but the simplest possible test would have worked here. And a particularly nice irony is provided by the fact that I’ve just come from a four-year project whose aims included trying to improve the way software is tested in academia.

I’ve now published a fixed version of the plugin (v1.1), available for download here. This one has been regression tested against known-good output, and the tests are in the repository for future use. The broken version is actually gone from the download page (though of course it is still tagged in the source repository), to avoid anyone getting the wrong one by accident.

I’m also working on a way to make simple regression tests easier to provide and run, for the other plugins I work on.

That’s all for the “public service announcement” bit of this post; read on only if you’re interested in the details.

What was the change that broke it? Well, it was a change I made after running the plugin through the Vamp Plugin Tester, a sort of automated fuzz-testing tool that helps you find problems with your code. (Again, there’s an irony here. Using this tool is undoubtedly a good practice, as it can show up all sorts of problems that might not be apparent to developers otherwise. Even so, I should have known well how common it is to introduce bugs while fixing things like compiler warnings and static analysis tests.)

The problem I was trying to fix here was that intermediate floating-point divisions sometimes overflowed, resulting in infinity values in the output. This only happened for unusual inputs, so it appeared reasonable to fix it by clamping intermediate values when they appeared to be blowing up out of the expected range. But I set the threshold too low, so that many intermediate values from legitimate inputs were also being mangled. I then also made a stupid typo that made the results a bit worse still (you can see the change in question around line 500 of the file in this diff).

Note that this only broke the output from the Chordino chord estimator, not the other features calculated by NNLS Chroma.

A digression. An ongoing topic of debate in the world of the Research Software Engineer is whether software development resources in academia should be localised within research groups, or centralised.

The localised approach, which my research group has taken with my own position, employs developers directly within a research subject. The centralised approach, typified by the Research Software Development group at UCL, proposes a group of software developers who are loaned or hired out to research groups according to need and availability.

In theory, the localised approach can be simpler to manage and should increase the likelihood of developers being available to help with small pieces of work requiring subject knowledge at short notice. The centralised approach has the advantage that all developers can share the non-subject-specific parts of their workload and knowhow.

I believe that in general a localised approach is useful, and I suspect it is easier to hire developers for a specific research group than to find developers good enough to be able to parachute in to anywhere from a central team.

In a case like this, though, the localised approach makes for quite a lonely situation.

Companies that produce large software products that work do so not because they employ amazing developers but because they have systems in place to support them: code review, unit testing, regression tests, continuous integration, user acceptance tests.

But for me as a lone professional developer in a research group, it’s essentially my responsibility to provide those safety nets as well as to use them. I had some of them in place for most of the code I work on, but there was a big hole for this particular project. I broke the code, and I didn’t notice because I didn’t have the right tests ready. Neither did the researcher who wrote most of this code, but that wasn’t his job. When some software goes out from this group that I have worked on, it’s my responsibility to make sure that the code aspects of it (as opposed to the underlying methods) work correctly. Part of my job has to be to assume that nobody else will be in a position to help.


… and an FFT in Standard ML

While writing my earlier post on Javascript FFTs, I also (for fun) adapted the Nayuki FFT code into Standard ML. You can find it here.

The original idea was to see how performance of SML compiled to native code, and SML compiled to Javascript using smltojs, compared with the previously-tested Javascript implementations and with any other SML versions I could find. (There’s FFT and DFT code from Rory McGuire here, probably written for clarity rather than speed, plus versions I haven’t tried in the SML/NJ and MLKit test libraries.)

I didn’t get as far as doing a real comparison, but I did note that it ran at more or less the same speed when compiled natively with MLton as the Javascript version does when run in Firefox, and that compiling to JS with smltojs produced much slower code. I haven’t checked where the overhead lies.

FFTs in Javascript

Javascript engines are quite fast these days. Can we get away with doing serious signal-processing in Javascript yet?

People are doing things like image processing tools and audio spectrum visualisation in Javascript, so the answer must sometimes be yes, but I wanted to get an idea how well you’d get on with more demanding tasks like spectral audio processing. Although I’ve done this in many languages, Javascript has not yet been among them. And given that one reason you would do this would be for portability, the question is connected with the performance of target platforms such as mobile browsers.

Let’s consider the demands of a phase vocoder, commonly used for audio time-stretching (in the Rubber Band Library for example). To make a phase vocoder you chop up a sampled audio signal into short overlapping frames, take the short-time Fourier transform of each, convert its complex output to polar form, modify the phases, transform back again, and glue back together with a different amount of overlap from the original, thus changing the overall duration. Most of the overhead typically is in the forward and inverse Fourier transforms and in Cartesian-to-polar conversion (arctangents are slow).

If you have a mono signal sampled at 44.1kHz and you’re using a simple design with a 2048-point Fast Fourier Transform (FFT) with 75% overlap, you’ll need to go through this process at least 86 times a second to run at the equivalent of real-time. Budgeting maybe a third of the overall time for FFTs, and allowing equal time for forward and inverse transforms, I suppose you want to be able to do at least 500 transforms per second to have a decent chance of running nicely.

I decided to try some existing FFT implementations in Javascript and see how they went.

The candidates

These are the implementations I tried. (In parentheses are the labels I’ve used to identify them in the following tables.)

  • Multi-lingual FFT by Nayuki (nayuki)
  • The Nayuki code, but rearranged into an object that does sin/cos table generation once on construction: code is here (nayuki-obj)
  • FFT.js by Jens Nockert (nockert)
  • JSFFT by Nick Jones (dntj)

I also took three existing C libraries and compiled them to Javascript with Emscripten:

  • KissFFT, a small but reasonably sophisticated implementation by Mark Borgerding (kissfft)
  • FFTW3, a big library by Matteo Frigo et al (fftw). This library is so focused on native platform-specific performance tricks that it seems crazy to cross-compile it to Javascript, and I would never have thought to do so if it wasn’t for the fact that KissFFT worked better than I’d expected
  • Don Cross’s simple public domain implementation (cross)

Plus a couple that came up in searches but that I did not end up testing:

  • Timbre.js by mohayonao — this is designed to pipeline with other classes in the same framework rather than running standalone
  • dsp.js by Corban Brook — only provides an API to retrieve the magnitude spectrum rather than complex frequency-domain output

I timed only the real-to-complex forward transform. This favours those implementations that optimise for real inputs, which of the above I think means only KissFFT and FFTW3. You can find all the test code in this code project, and I put up a page here which runs the tests in your browser.

Some numbers

These figures are for the number of 2048-point forward transforms per second: higher numbers are better.

Most of these results are from desktop browsers, but I’ve included a couple of mobile devices (thanks to James and Andy for the iPhone numbers). I’ve also included timings of a sample native-code build, made using FFTW3 on Linux, Accelerate on OSX and iOS, and a 32-bit IPP build on Windows.

For each platform I’ve highlighted the fastest “real Javascript” version, and the fastest version overall apart from the native code, in bold.

Javascript C/Emscripten Native
nayuki nayuki-obj nockert dntj kissfft fftw cross
Lenovo Yoga 2 Pro, Firefox 41, Linux 5200 8000 2900 850 23000 18000 10000 94000
Lenovo Yoga 2 Pro, Chromium 46, Linux 5200 6500 3600 2500 15000 7000 5800 94000
Dell XPS 13 2015, Firefox 41, Windows 10 4600 7300 2300 660 18000 11000 7400 43000
Dell XPS 13 2015, MS Edge, Windows 10 5000 6900 1800 990 13000 1900 4900 43000
MacBook Air 11″ 2012, Firefox 40, OSX 10.10 4000 5200 2200 720 8000 12000 7400 54000
MacBook Air 11″ 2012, Safari 8, OSX 10.10 3000 3500 1300 890 10000 2800 6000 54000
Nokia Lumia 620, IE, WP8.1 380 390 190 94 710 310
Alcatel OneTouch Fire E, Firefox OS 2.0 270 310 150 87 790 270 320
Apple iPad mini 2012, Safari, iOS 9 300 340 150 120 710 330 390 2000
Apple iPhone 6, iOS UIWebView, iOS 9 72 72 90 71 180 95 61
Apple iPhone 6, Safari, iOS 9 3800 4300 960 710 8700 2000 4800
Apple iPhone 6S, Safari, iOS 9 6000 7000 1600 1000 9900 2500 7400

The FFTW run didn’t seem to want to produce a result on the WP8 phone.

Some observations

On every platform there is at least one implementation that makes our 500 transforms/sec budget, although some have little to spare. (The exception is the iOS UIWebView, which doesn’t include a modern Javascript engine, but since iOS 8 it seems hybrid apps are allowed to use the fast WebKit web view instead. I haven’t included any Android results, but a similar divide exists between Google Chrome and the built-in browser on earlier Android versions.)

Usually the cross-compiled KissFFT is the fastest option. Although it seems culturally strange for the fastest Javascript version to be one that is not written in Javascript at all, I think this is a positive result: KissFFT simply has faster algorithms than any of the plain Javascript versions, and the compilation process is robust enough to preserve this advantage. The case of FFTW is stranger, but may have something to do with its relatively large girth.

If code size is a primary concern, the Nayuki code appears to be an effective implementation of the core algorithm.

The code produced by Emscripten, in the versions compiled from C, is asm.js. Firefox gives this restricted subset of Javascript special treatment, and it appears to be possible to see how much difference that makes: if you enable the developer console, a message appears warning you that asm.js optimisations are disabled while the console is active. The results are then 10% or so slower.

Overall we’re seeing between 2.4 and 5.4 times slowdown compared to a fast native-code version. This may not be a great use of CPU and battery if signal processing is your application’s main job, but good enough to make some interesting things.

Any thoughts?

New software releases all around

A few months ago (in February!!) I wrote a post called Unreleased project pile-up that gave a pretty long list of software projects I’d been working on that could benefit from a proper release. It ended: let’s see how many of these I can tidy up & release during the next few weeks. The answer: very few.

During the past couple of weeks I’ve finally managed to make a bit of a dent, crossing off these applications from the list:

along with these earlier in the year:

and one update that wasn’t on the list:

Apart from the Python Vamp host, those all fall into the category of “overdue updates”. I haven’t managed to release as much of the actually new software on the list.

One “overdue update” that keeps getting pushed back is the next release of Sonic Visualiser. This is going to be quite a major release, featuring audio recording (a feature I once swore I would never add), proper support for retina and hi-dpi displays with retina rendering in the view layers and a new set of scalable icons, support for very long audio files (beyond the 32-bit WAV limit), a unit conversion window to help convert between units such as Hz and MIDI pitch, and a few other significant features. There’s a little way to go before release yet though.

Rosegarden v15.08

D. Michael McIntyre today announced the release of version 15.08 of Rosegarden, an audio and MIDI sequencer and music notation editor.

Rosegarden is a slightly crazy piece of work.

As a project it has existed for more than two decades, and the repository containing its current code was initialised in April 2000. It’s not a huge program, but it is quite complicated, and during its most active period it was run by three argumentative developers all trying to accomplish slightly different things. I wanted to replace Sibelius, and typeset string quartets. Richard wanted to replace Logic and run his home studio with it. Guillaume wanted to replace Band-in-a-Box and make jazz guitar arrangements. We ended up with something that is essentially a MIDI sequencer, but with some audio recording and arrangement capacity and a lot of interesting (fragile) logic for adjusting score layout of music that is stored as MIDI-plus-notation-metadata rather than directly as notation.

Rosegarden has all sorts of rather wild features which even its developers routinely forget. It has a “triggered segment” feature that replaces single notes with algorithmically expanded sequences at playback time, intended for use in playing ornaments from notation but also potentially usable for simple algorithmic compositions. It knows the playable range and transpositions of large numbers of real instruments, and can transpose between them and warn when a part’s notes are out of range. It has a note-timing quantizer that aims to produce notation as well as possible from performed MIDI recordings, but that doesn’t change the underlying recorded MIDI, instead trying (futilely?) to keep tabs on the adaptations necessary to make the raw MIDI appear well on the score. It can record audio, play back MIDI through audio synth plugins, apply effects, and do basic audio editing and timestretching. It has a feature which surely nobody except me has ever used, that allows you to tap along on a MIDI keyboard to the beats in an audio recording and then inserts tempo events in your MIDI (assuming it represents the same score as the audio you were tapping along to) that make it play back with the same tempo changes as the audio.

Rosegarden contains about 300,000 lines of C++, excluding all its library dependencies, with (ahem) no tests of any sort. It has seen well over 10,000 commits from about 40 contributors, in a single Subversion repository hosted at SourceForge. (Previously it was in CVS, but the move from CVS to Subversion was hard enough that it has never moved again. Some of its current developers use git, but they do so through a bridge to the Subversion repository.) Although the code is moderately portable, and lightly-supported ports to Windows and OS/X have appeared, the only platform ever officially supported is Linux and the code has only been officially published in source code form—it is assumed that Linux distributions will handle compilation and packaging.

Despite its complexities and disadvantages, Rosegarden has survived reasonably well; it appears still to be one of the more widely-used programs of its type. Admittedly this is in a tiny pond—Linux-only audio and music users—but it has persisted in spite of all of its early active developers having left the project. Here are the top three committers per year since 2000, by number of commits:

2000 Guillaume Laurent Chris Cannam
2001 Guillaume Laurent Chris Cannam Richard Bown
2002 Richard Bown Guillaume Laurent Chris Cannam
2003 Chris Cannam Guillaume Laurent Richard Bown
2004 Chris Cannam Guillaume Laurent Richard Bown
2005 Guillaume Laurent Chris Cannam D. Michael McIntyre
2006 Chris Cannam Pedro Lopez-Cabanillas Guillaume Laurent
2007 Chris Cannam Heikki Junes Guillaume Laurent
2008 D. Michael McIntyre Chris Cannam Heikki Junes
2009 D. Michael McIntyre Chris Cannam Jani Frilander
2010 D. Michael McIntyre Julie Swango Chris Cannam
2011 D. Michael McIntyre Ted Felix Yves Guillemot
2012 Ted Felix D. Michael McIntyre Tom Breton
2013 D. Michael McIntyre Ted Felix Tom Breton
2014 Ted Felix D. Michael McIntyre Tom Breton
2015 Ted Felix D. Michael McIntyre Tom Breton

Some developers (Tom Breton for example) flatten numerous commits from git into single Subversion commits for the official repo and are probably under-represented, but this gives the general shape. Richard Bown mostly retired from this project in 2005, although his 794 commits in 2002 seems still to be the record. (Ted Felix has made 231 so far this year.) Guillaume Laurent forcefully moved to OS/X in 2007, and I faded out in 2010 after a big push to port Rosegarden from Qt3 to Qt4. What is most notable is the unifying thread provided by D. Michael McIntyre.