MIREX 2015 submissions

For the past three years now, we’ve taken a number of Vamp audio analysis plugins published by the Centre for Digital Music and submitted them to the annual MIREX evaluation. The motivation is to give other methods a baseline to compare against, to compare one year’s evaluation metrics and datasets against the next year’s, and to give our group a bit of visibility. See my posts about this process in 2014 and in 2013.

Here are this year’s outcomes. All these categories are ones we had submitted to before, but I managed to miss a couple of category deadlines last year, so in total we had more categories than in either 2013 or 2014.

Structural Segmentation

Results for the four datasets are here, here, here, and here. This is one of the categories I missed last year and, although I find the evaluations quite hard to understand, it’s clear that the competition has moved on a bit.

Our own submissions, the Segmentino plugin from Matthias Mauch and the much older QM Segmenter from Mark Levy, produced the expected results (identical to 2013 for Segmentino; similar for QM Segmenter, which has a random initialisation step). As before, Segmentino obtains the better scores. There was only one other submission this year, a convolutional neural network based approach from Thomas Grill and Jan Schlüter which (I think) outperformed both of ours by some margin, particularly on the segment boundary measure.

Multiple Fundamental Frequency Estimation and Tracking

Results here and here. In addition to last year’s submission for the note tracking task of this category, this year I also scripted up a submission for the multiple fundamental frequency estimation task. Emmanouil Benetos and I had made some tweaks to the Silvet plugin during the year, and we also submitted a new fast “live mode” version of it. The evaluation also includes a new test dataset this year.

Our updated Silvet plugin scores better than last year’s version in every test they have in common, and the “live mode” version is actually not all that far off, considering that it’s very much written for speed. (Nice to see a report of run times in the results page — Silvet live mode is 15-20 times as fast as the default Silvet mode and more than twice as fast as any other submission.) Emmanouil’s more recent research method does substantially better, but this is still a pleasing result.

This category is an extremely difficult one, and it’s also painfully difficult to get good test data for it. There’s plenty of potential here, but it’s worth noting that a couple of the authors of the best submissions from last year were not represented this year — in particular, if Elowsson and Friberg’s 2014 method had appeared again this year, it looks as if it would still be at the top.

Audio Onset Detection

Results here. Although the top scores haven’t improved since last year, the field has broadened a bit — it’s no longer only Sebastian Böck vs the world. Our two submissions, both venerable methods, are now placed last and second-last.

Oddly, our OnsetsDS submission gets slightly better results than last year despite being the same, deterministic, implementation (indeed exactly the same plugin binary) run on the same dataset. I should probably check this with the task captain.

Audio Beat Tracking

Results here, here, and here. Again the other submissions are moving well ahead and our BeatRoot and QM Tempo Tracker submissions, producing unchanged results from last year and the year before, are now languishing toward the back. (Next year will see BeatRoot’s 15th birthday, by the way.) The top of the leaderboard is largely occupied by a set of neural network based methods from Sebastian Böck and Florian Krebs.

This is a more interesting category than it gets credit for, I think — still improving and still with potential. Some MIREX categories have very simplistic test datasets, but this category introduced an intentionally difficult test set in 2012 and it’s notable that the best new submissions are doing far better here than the older ones. I’m not quite clear on how the evaluation process handles the problem of what the ground truth represents, and I’d love to know what a reasonable upper bound on F-measure might be.

Audio Tempo Estimation

Results here. This is another category I missed last year, but we get the same results for the QM Tempo Tracker as we did in 2013. It still does tolerably well considering its output isn’t well fitted to the evaluation metric (which rewards estimators that produce best and second-best estimates across the whole piece).

The top scorer here is a neural network approach (spotting a theme here?) from Sebastian Böck, just as for beat tracking.

Audio Key Detection

Results here and here. The second dataset is new.

The QM Key Detector gets the same results as last year for the dataset that existed then. It scores much worse on the new dataset, which suggests that may be a more realistic test. Again there were no other submissions in this category — a pity now that it has a second dataset. Does nobody like key estimation? (I realise it’s a problematic task from a musical point of view, but it does have its applications.)

Audio Chord Estimation

Poor results for Chordino because of a bug which I went over at agonising length in my previous post. This problem is now fixed in Chordino v1.1, so hopefully it’ll be back to its earlier standard in 2016!

Some notes

Neural networks

… are widely-used this year. Several categories contained at least one submission whose abstract referred to a convolutional or recurrent neural network or deep learning, and in at least 5 categories I think a neural network method can reasonably be said to have “won” the category. (Yes I know, MIREX isn’t a competition…)

  • Structural segmentation: convolutional NN performed best
  • Beat tracking: NNs all over the place, definitely performing best
  • Tempo estimation: NN performed best
  • Onset detection: NN performed best
  • Multi-F0: no NNs I think, but it does look as if last year’s “deep learning” submission would have performed better than any of this year’s
  • Chord estimation: NNs present, but not yet quite at the top
  • Key detection: no NNs, indeed no other submissions at all

Categories I missed

  • Audio downbeat estimation: I think I just overlooked this one, for the second year in a row. As last year, I should have submitted the QM Bar & Beat Tracker plugin from Matthew Davies.
  • Real-time audio to score alignment: I nearly submitted the MATCH Vamp Plugin for this, but actually it only produces a reverse path (offline alignment) and not a real-time output, even though it’s a real-time-capable method internally.

Other submissions from the Centre for Digital Music

Emmanouil Benetos submitted a well-performing method, mentioned above, in the Multiple Fundamental Frequency Estimation & Tracking category.

Apart from that, there appear to be none.

This feels like a pity — evaluation is always a pain and it’s nice to get someone else to do some of it.

It’s also a pity because several of the plugins I’m submitting are getting a bit old and are falling to the bottom of the results tables. There are very sound reasons for submitting them (though I may drop some of the less well performing categories next year, assuming I do this again) but it would be good if they didn’t constitute the only visibility QM researchers have in MIREX.

Why would this be the case? I don’t really know. The answer presumably must include some or all of

  • not working on music informatics signal-processing research at all
  • working on research that builds on feature extractors, rather than building the feature extractors themselves
  • research not congruent with MIREX tasks (e.g. looking at dynamics or articulations rather than say notes or chords)
  • research uses similar methods but not on mainstream music recordings (e.g. solo singing, animal sounds)
  • state-of-the-art considered good enough
  • lack the background to compete with current methods (e.g. the wave of NNs) and so sticking with progressive enhancements of existing models
  • lack the data to compete with current methods
  • not aware of MIREX
  • not prioritised by supervisor

The last four reasons would be a problem, but the rest might not be. It could really be that MIREX isn’t very relevant to the people in this group at the moment. I’ll have to see what I can find out.

Chordino troubles

On September the 9th, I released a v1.0 build of the Chordino and NNLS Chroma Vamp plugin. This plugin analyses audio recordings of music and calculates some harmonic features, including an estimated chord transcription. When used with Sonic Visualiser, Chordino is potentially very useful for anyone who likes to play along with songs, as well as for research.

Chordino was written by Matthias Mauch, based on his own research. Although I made this 1.0 release, my work on it only really extended as far as fixing some bugs found in earlier releases using the Vamp Plugin Tester.

Unfortunately, with one of those fixes, I broke the plugin. The supposedly more reliable 1.0 update was substantially less accurate at identifying both chord-change boundaries and the chords themselves than any previous version.

I didn’t notice. Nor did Matthias, who had recently left our research group and was busy starting at a new job. One colleague sent me an email saying he had problems with the new release, but I jumped to the completely wrong conclusion that this had something to do with parameter settings having changed since the last release, and suggested he raise it with Matthias.

I only realised what had happened after we submitted the plugin to MIREX evaluation, something we do routinely every year for plugins published by C4DM, when the MIREX task captain Johan Pauwels emailed to ask whether I had expected its scores to drop by 15% from the previous year’s submission (see results pages for 2014, 2015). By that time, the broken plugin had been available for over a month.

This is obviously hugely embarrassing—perhaps the most unambiguous screwup of my whole programming career so far. As the supposed professional software developer of my research group, I took someone else’s working code, broke it, published and promoted the broken version with his name all over it, and then submitted it to a public evaluation, again with his name on it, where its brokenness was finally made pointed out to me, a month later, by someone else. Any regression test, even on only a single audio file, would have shown up the problem immediately. Regression-testing this sort of software can be tricky, but the simplest possible test would have worked here. And a particularly nice irony is provided by the fact that I’ve just come from a four-year project whose aims included trying to improve the way software is tested in academia.

I’ve now published a fixed version of the plugin (v1.1), available for download here. This one has been regression tested against known-good output, and the tests are in the repository for future use. The broken version is actually gone from the download page (though of course it is still tagged in the source repository), to avoid anyone getting the wrong one by accident.

I’m also working on a way to make simple regression tests easier to provide and run, for the other plugins I work on.

That’s all for the “public service announcement” bit of this post; read on only if you’re interested in the details.

What was the change that broke it? Well, it was a change I made after running the plugin through the Vamp Plugin Tester, a sort of automated fuzz-testing tool that helps you find problems with your code. (Again, there’s an irony here. Using this tool is undoubtedly a good practice, as it can show up all sorts of problems that might not be apparent to developers otherwise. Even so, I should have known well how common it is to introduce bugs while fixing things like compiler warnings and static analysis tests.)

The problem I was trying to fix here was that intermediate floating-point divisions sometimes overflowed, resulting in infinity values in the output. This only happened for unusual inputs, so it appeared reasonable to fix it by clamping intermediate values when they appeared to be blowing up out of the expected range. But I set the threshold too low, so that many intermediate values from legitimate inputs were also being mangled. I then also made a stupid typo that made the results a bit worse still (you can see the change in question around line 500 of the file in this diff).

Note that this only broke the output from the Chordino chord estimator, not the other features calculated by NNLS Chroma.

A digression. An ongoing topic of debate in the world of the Research Software Engineer is whether software development resources in academia should be localised within research groups, or centralised.

The localised approach, which my research group has taken with my own position, employs developers directly within a research subject. The centralised approach, typified by the Research Software Development group at UCL, proposes a group of software developers who are loaned or hired out to research groups according to need and availability.

In theory, the localised approach can be simpler to manage and should increase the likelihood of developers being available to help with small pieces of work requiring subject knowledge at short notice. The centralised approach has the advantage that all developers can share the non-subject-specific parts of their workload and knowhow.

I believe that in general a localised approach is useful, and I suspect it is easier to hire developers for a specific research group than to find developers good enough to be able to parachute in to anywhere from a central team.

In a case like this, though, the localised approach makes for quite a lonely situation.

Companies that produce large software products that work do so not because they employ amazing developers but because they have systems in place to support them: code review, unit testing, regression tests, continuous integration, user acceptance tests.

But for me as a lone professional developer in a research group, it’s essentially my responsibility to provide those safety nets as well as to use them. I had some of them in place for most of the code I work on, but there was a big hole for this particular project. I broke the code, and I didn’t notice because I didn’t have the right tests ready. Neither did the researcher who wrote most of this code, but that wasn’t his job. When some software goes out from this group that I have worked on, it’s my responsibility to make sure that the code aspects of it (as opposed to the underlying methods) work correctly. Part of my job has to be to assume that nobody else will be in a position to help.

 

… and an FFT in Standard ML

While writing my earlier post on Javascript FFTs, I also (for fun) adapted the Nayuki FFT code into Standard ML. You can find it here.

The original idea was to see how performance of SML compiled to native code, and SML compiled to Javascript using smltojs, compared with the previously-tested Javascript implementations and with any other SML versions I could find. (There’s FFT and DFT code from Rory McGuire here, probably written for clarity rather than speed, plus versions I haven’t tried in the SML/NJ and MLKit test libraries.)

I didn’t get as far as doing a real comparison, but I did note that it ran at more or less the same speed when compiled natively with MLton as the Javascript version does when run in Firefox, and that compiling to JS with smltojs produced much slower code. I haven’t checked where the overhead lies.