MIREX 2015 submissions

For the past three years now, we’ve taken a number of Vamp audio analysis plugins published by the Centre for Digital Music and submitted them to the annual MIREX evaluation. The motivation is to give other methods a baseline to compare against, to compare one year’s evaluation metrics and datasets against the next year’s, and to give our group a bit of visibility. See my posts about this process in 2014 and in 2013.

Here are this year’s outcomes. All these categories are ones we had submitted to before, but I managed to miss a couple of category deadlines last year, so in total we had more categories than in either 2013 or 2014.

Structural Segmentation

Results for the four datasets are here, here, here, and here. This is one of the categories I missed last year and, although I find the evaluations quite hard to understand, it’s clear that the competition has moved on a bit.

Our own submissions, the Segmentino plugin from Matthias Mauch and the much older QM Segmenter from Mark Levy, produced the expected results (identical to 2013 for Segmentino; similar for QM Segmenter, which has a random initialisation step). As before, Segmentino obtains the better scores. There was only one other submission this year, a convolutional neural network based approach from Thomas Grill and Jan Schlüter which (I think) outperformed both of ours by some margin, particularly on the segment boundary measure.

Multiple Fundamental Frequency Estimation and Tracking

Results here and here. In addition to last year’s submission for the note tracking task of this category, this year I also scripted up a submission for the multiple fundamental frequency estimation task. Emmanouil Benetos and I had made some tweaks to the Silvet plugin during the year, and we also submitted a new fast “live mode” version of it. The evaluation also includes a new test dataset this year.

Our updated Silvet plugin scores better than last year’s version in every test they have in common, and the “live mode” version is actually not all that far off, considering that it’s very much written for speed. (Nice to see a report of run times in the results page — Silvet live mode is 15-20 times as fast as the default Silvet mode and more than twice as fast as any other submission.) Emmanouil’s more recent research method does substantially better, but this is still a pleasing result.

This category is an extremely difficult one, and it’s also painfully difficult to get good test data for it. There’s plenty of potential here, but it’s worth noting that a couple of the authors of the best submissions from last year were not represented this year — in particular, if Elowsson and Friberg’s 2014 method had appeared again this year, it looks as if it would still be at the top.

Audio Onset Detection

Results here. Although the top scores haven’t improved since last year, the field has broadened a bit — it’s no longer only Sebastian Böck vs the world. Our two submissions, both venerable methods, are now placed last and second-last.

Oddly, our OnsetsDS submission gets slightly better results than last year despite being the same, deterministic, implementation (indeed exactly the same plugin binary) run on the same dataset. I should probably check this with the task captain.

Audio Beat Tracking

Results here, here, and here. Again the other submissions are moving well ahead and our BeatRoot and QM Tempo Tracker submissions, producing unchanged results from last year and the year before, are now languishing toward the back. (Next year will see BeatRoot’s 15th birthday, by the way.) The top of the leaderboard is largely occupied by a set of neural network based methods from Sebastian Böck and Florian Krebs.

This is a more interesting category than it gets credit for, I think — still improving and still with potential. Some MIREX categories have very simplistic test datasets, but this category introduced an intentionally difficult test set in 2012 and it’s notable that the best new submissions are doing far better here than the older ones. I’m not quite clear on how the evaluation process handles the problem of what the ground truth represents, and I’d love to know what a reasonable upper bound on F-measure might be.

Audio Tempo Estimation

Results here. This is another category I missed last year, but we get the same results for the QM Tempo Tracker as we did in 2013. It still does tolerably well considering its output isn’t well fitted to the evaluation metric (which rewards estimators that produce best and second-best estimates across the whole piece).

The top scorer here is a neural network approach (spotting a theme here?) from Sebastian Böck, just as for beat tracking.

Audio Key Detection

Results here and here. The second dataset is new.

The QM Key Detector gets the same results as last year for the dataset that existed then. It scores much worse on the new dataset, which suggests that may be a more realistic test. Again there were no other submissions in this category — a pity now that it has a second dataset. Does nobody like key estimation? (I realise it’s a problematic task from a musical point of view, but it does have its applications.)

Audio Chord Estimation

Poor results for Chordino because of a bug which I went over at agonising length in my previous post. This problem is now fixed in Chordino v1.1, so hopefully it’ll be back to its earlier standard in 2016!

Some notes

Neural networks

… are widely-used this year. Several categories contained at least one submission whose abstract referred to a convolutional or recurrent neural network or deep learning, and in at least 5 categories I think a neural network method can reasonably be said to have “won” the category. (Yes I know, MIREX isn’t a competition…)

  • Structural segmentation: convolutional NN performed best
  • Beat tracking: NNs all over the place, definitely performing best
  • Tempo estimation: NN performed best
  • Onset detection: NN performed best
  • Multi-F0: no NNs I think, but it does look as if last year’s “deep learning” submission would have performed better than any of this year’s
  • Chord estimation: NNs present, but not yet quite at the top
  • Key detection: no NNs, indeed no other submissions at all

Categories I missed

  • Audio downbeat estimation: I think I just overlooked this one, for the second year in a row. As last year, I should have submitted the QM Bar & Beat Tracker plugin from Matthew Davies.
  • Real-time audio to score alignment: I nearly submitted the MATCH Vamp Plugin for this, but actually it only produces a reverse path (offline alignment) and not a real-time output, even though it’s a real-time-capable method internally.

Other submissions from the Centre for Digital Music

Emmanouil Benetos submitted a well-performing method, mentioned above, in the Multiple Fundamental Frequency Estimation & Tracking category.

Apart from that, there appear to be none.

This feels like a pity — evaluation is always a pain and it’s nice to get someone else to do some of it.

It’s also a pity because several of the plugins I’m submitting are getting a bit old and are falling to the bottom of the results tables. There are very sound reasons for submitting them (though I may drop some of the less well performing categories next year, assuming I do this again) but it would be good if they didn’t constitute the only visibility QM researchers have in MIREX.

Why would this be the case? I don’t really know. The answer presumably must include some or all of

  • not working on music informatics signal-processing research at all
  • working on research that builds on feature extractors, rather than building the feature extractors themselves
  • research not congruent with MIREX tasks (e.g. looking at dynamics or articulations rather than say notes or chords)
  • research uses similar methods but not on mainstream music recordings (e.g. solo singing, animal sounds)
  • state-of-the-art considered good enough
  • lack the background to compete with current methods (e.g. the wave of NNs) and so sticking with progressive enhancements of existing models
  • lack the data to compete with current methods
  • not aware of MIREX
  • not prioritised by supervisor

The last four reasons would be a problem, but the rest might not be. It could really be that MIREX isn’t very relevant to the people in this group at the moment. I’ll have to see what I can find out.