MIREX 2017 submissions

For the fifth year in a row, this year the Centre for Digital Music submitted a number of Vamp audio analysis plugins to the MIREX evaluation for “music information retrieval” tasks. This year we submitted the same set of plugins as last year; there were no new implementations, and some of the existing ones are so old as to have celebrated their tenth birthday earlier in the year. So the goal is not to provide state-of-the-art results, but to give other methods a stable baseline for comparison and to check each year’s evaluation metrics and datasets against neighbouring years. I’ve written about this in each of the four previous years: see posts about 2016, 2015, 2014, and 2013.

Obviously, having submitted exactly the same plugins as last year, we expect basically the same results. But the other entries we’re up against will have changed, so here’s a review of how each category went.

(Note: we dropped one category this year, Audio Downbeat Estimation. Last year’s submission was not well prepared for reasons I touched on in last year’s post, and I didn’t find time to rework it.)

Structural Segmentation

Results for the four datasets are here, here, here, and here. Our results, for Segmentino from Matthias Mauch and the older QM Segmenter from Mark Levy, were the same as last year, with the caveat that the QM Segmenter uses random initialisation and so never gets exactly the same results twice.

Surprisingly, nobody else entered anything to this category this year, which seems a pity because it’s an interesting problem. This category seems to have peaked around 2012-2013.

Multiple Fundamental Frequency Estimation and Tracking

An exciting year for this mind-bogglingly difficult category, with 14 entries from ten different sets of authors and a straight fight between template decomposition methods (including our Silvet plugin, from Emmanouil Benetos’s work) and trendy convolutional neural networks. Results are here and here.

With so many entries and evaluations it’s not that easy to get a clear picture, and no single method appears to be overwhelmingly strong. There were fine results in some evaluations for CNN methods from Thickstun et al and Thomé and Ahlbäck, for Pogorelyuk and Rowley‘s very intriguing “Dynamic Mode Decomposition”, and for a few others whose abstracts are missing from the entry site and so can’t be linked to.

Silvet, with the same results as last year, does well enough to be interesting, but in most cases it isn’t troubling the best of the newer methods.

Audio Onset Detection

Bit of a puzzle here, as our two plugin submissions both got slightly different results from last year despite being unchanged implementations of deterministic methods invoked in the same way on the same data sets.

Last year saw a big expansion in the number of entries, and this year there were nearly as many. Just as last year, our old plugins did modestly, but again some of the new experiments fared a bit less well so we weren’t quite at the bottom. Results here.

Audio Beat Tracking

Same puzzle as in onset detection: while our results were basically similar to last year, they weren’t identical. The 2015 and 2016 results were identical and we would have expected the same again in 2017.

That apart, there’s little to report since last year. Results are here, here, and here.

Audio Tempo Estimation

Last year there were two entries in this category, ours and a much stronger one from Sebastian Böck. This year sees one addition, from Hendrick Schreiber and Meinard Müller, which fares creditably. The results are here.

Audio Key Detection

Two pretty successful new submissions this year, both using convolutional neural networks: one from Korzeniowski, Böck, Krebs and Widmer, and the other from Hendrik Schreiber. Our old plugin (from work by Katy Noland) does not fare tragically, but it’s clear that some other methods are getting much closer to the sort of performance one imagines should be realistic. The results are linked from here.

Intuitively, key estimation seems like the sort of problem that is interesting only so long as you don’t have enough training data. As a 24-way classification with large enough training datasets, it looks a bit mundane. The problem becomes, what does it mean for a piece of music to be in a particular key anyway? Submissions are not expected to answer that, but presumably it sets an upper bound on performance.

Audio Chord Estimation

Another increase in the number of test datasets, from 5 to 7, and a strong category again. Last year our submission Chordino (by Matthias Mauch) was beginning to trail, though it wasn’t quite at the back. This year some of the weaker submissions have not been repeated, some new entries have appeared, and Chordino is in last place for every evaluation. It’s not far behind — perceptually it’s still a pretty good algorithm — but some of the other methods are very impressive now. Here are the results.

The abstracts accompanying the two submissions from the audio information processing group at Fudan University in Shanghai (Jiang, Li and Wu and Wu, Feng and Li) are both well worth a read. The former paper refers closely to Chordino, using the same NNLS Chroma features with a new front-end. Meanwhile, the latter paper proposes a method worth remembering for dinner parties, using deep residual networks trained from MIDI-synchronised constant-Q representations of audio with a bidirectional long-short-term memory and conditional random field for labelling.