MIREX 2014 submissions

Last year, Luís Figueira and I experimentally submitted a batch of audio analysis methods, implemented in Vamp plugins developed over the past few years at the C4DM, to the Music Information Retrieval Evaluation Exchange (MIREX). I found the process interesting and wrote an article about the results.

I wasn’t sure whether to do a repeat submission this year—most of the plugins would be the same—but Simon Dixon persuaded me. The test datasets might change; it might be interesting to see whether results are consistent from one year to the next; and it’s always good to provide one more baseline for other submissions to compare themselves against. So I dusted off last year’s submission scripts, added the new Silvet note transcription plugin, and submitted them.

Here goes with the outcomes. There is also an overview poster published by MIREX. See last year’s article for more information about what the tasks consist of.

Multiple Fundamental Frequency Estimation and Tracking

The only category we didn’t submit to last year. This is the problem of deducing which notes are being played, and at what times, in music where more than one note happens at once. I submitted the Silvet plugin which is based on a method by Emmanouil Benetos that had performed well in MIREX in an earlier year.

The results for this category are divided into two parts, multiple fundamental frequency estimation and note tracking. I submitted a script only for the note tracking part. I would describe the performance of our plugin as “correct”, in that it was reliably mid-pack across the board, pretty good for piano transcription, and generally marginally better than the MIREX 2012 submission which inspired it.

This was a fairly popular category this year, and one submission in particular improved quite substantially on previous years’ results—it may be no coincidence that that submission’s abstract employs the phrase-of-the-moment deep learning.

Audio Onset Detection

The same two submissions as last year (OnsetsDS and QM Onset Detector) and exactly the same results—the test dataset is unchanged and the plugins are entirely deterministic. Last year I remarked that our methods are quite old and other submissions should improve on them over time, but this year’s top methods were actually no improvement on last year’s.

Audio Beat Tracking

Again the same two submissions as last year (BeatRoot and QM Tempo Tracker) and exactly the same results (1, 2, 3), behind the front-runners but still reasonably competitive. While the best-performing methods continue to advance, it’s clear that beat tracking is still not a solved problem.

Audio Key Detection

Last year we entered a plugin that wasn’t expected to do very well here, and it swept the field. This year everyone else seems to have dropped out, so our repeat submission was in fact the only entry! (It got the same results as last year.)

Audio Chord Estimation

This is interesting partly because our submission (Chordino) performed very well last year but the evaluation metric has since changed.

Sadly, there were only three submissions this year. Chordino still looks good in all three datasets (1, 2, 3) but it is now ranked second rather than first for all three. I’m a bit disappointed that the new leading submission seems to be lacking a descriptive abstract.

Categories we could have entered but didn’t

Audio Melody Extraction

Last year’s submission wasn’t really good enough to repeat.

Audio Downbeat Estimation

I overlooked this task, which was new this year. Otherwise I could have submitted the QM Bar and Beat Tracker plugin.

Audio Tempo Estimation, Structural Segmentation

These categories had an earlier submission deadline than the rest, and stupidly I missed it.

QM Vamp Plugins in MIREX

During the past 7 years or so, we in the Centre for Digital Music have published quite a few audio analysis methods in the form of Vamp plugins: bits of software that you can download and use yourself with Sonic Visualiser, run on a set of audio recordings with Sonic Annotator, or use with your own applications.

Some of these methods were, and remain, pretty good. Some are reasonably good, simplified versions of work that was state-of-the-art at the time, but might not be any more. Some have always been less impressive. They are all available free, with source code—or with commercial licences for companies that want to incorporate them into their products.

This year we thought we should give them a trial against the current state-of-the-art in academia. Luis Figueira and I prepared a number of entries for the annual Music Information Retrieval Evaluation Exchange (or MIREX), submitting a Vamp plugin from our group in every category where we had one available.

MIREX, which is an excellent large-scale community endeavour organised by J Stephen Downie at UIUC, works by running your methods across a known test dataset of music recordings, comparing the results against “ground truth” produced in advance by humans, and publishing scores for how well each method compares.

Here’s how we got on for each evaluation task.

Audio Onset Detection

(That is, identifying the times in the music recording where each of the individual notes begin.)

We submitted two plugins here: the QM Onset Detector plugin implementing a number of (by now) standard methods, from Juan Bello and others back in 2005; and OnsetsDS, a refinement by Dan Stowell aimed at real-time use (so not directly relevant to this task). Both did modestly well. These methods have been published for a long time and become widely known, so it would be a disappointment if current work didn’t improve on them.

Audio Beat Tracking

(Tapping along with the beat.)

Here we entered the QM Tempo Tracker plugin, based on the work of Matthew Davies, and a Vamp plugin implementation of Simon Dixon‘s BeatRoot beat tracker. Both of these are now quite old methods (especially BeatRoot, although the plugin is new). The results for three datasets are here, here and here.

Both the original BeatRoot and a different version of Matthew Davies’ work were included in the MIREX evaluation back in 2006, and the ’06 dataset is one of the three used this year. So you can compare the 2006 versions here and the 2013 evaluations over here. They perform quite similarly, which is a relief. You can also see that the state of the art has moved on a bit.

Audio Tempo Estimation

(Coming up with an overall estimate in beats-per-minute of the tempo of a recording. Presumably the evaluation uses clips in which the tempo doesn’t vary.)

We entered the same QM Tempo Tracker plugin, from Matthew Davies, as used in the Beat Tracking evaluation. It doesn’t quite suit the evaluation metric, because the plugin estimates tempo changes rather than the two fixed tempo estimates (higher and lower, to allow for beat-period “octave” errors) the task calls for—but it performed pretty well. Again, a related method was evaluated on the same dataset in MIREX ’06 with quite similar results.

Audio Key Detection

(Estimating the overall key of the piece, insofar as that makes sense.)

We entered the QM Key Detector plugin for this task. This plugin, from Katy Noland back in 2007, is straightforward and fast, and is intended to detect key changes rather than the overall key.

To everyone’s surprise (including Katy’s) it scored better than any other entry, and indeed better than any entry from the past four years! The test dataset is pretty simplistic, but this is a nice result anyway.

Audio Melody Extraction

(Writing down the notes for the main melody from a recording which may have more than one instrument.)

Here we submitted my own cepstral pitch tracker plugin. This is not actually a melody extractor at all, but a monophonic pitch tracker with note estimation intended for solo singing. And it was developed as an exercise in test-driven development, rather than as a research outcome. It was not expected to do well. It actually did come out well in one dataset (solo vocal?), but it got weak results in the other three. I’m quite excited about having submitted something all-my-own-work to MIREX though.

Audio Chord Estimation

(Annotating the chord changes in a piece based on the recording.)

For this task we entered the Chordino plugin from Matthias Mauch. This plugin is much the same as the “Simple Chord Estimate” method that Matthias entered for MIREX in 2010; it got the same excellent results then and now for the dataset that was used in both years, and it also got the highest scores in the other dataset.

Structural Segmentation

(Dividing a song up into parts based on musical structure. The parts might correspond to verse, chorus, bridge, etc—though the segmenter is not required to label them, only to identify which ones have similar structural purpose.)

Two entries here. The Segmentino plugin from Matthias Mauch is fairly new, and is the only submission we made for which plugin code has not yet been released—we hope to remedy that soon. And we also entered Mark Levy‘s QM Segmenter plugin, an older and more lightweight method.

The results for different test datasets are here, here, here and here. The evaluation metrics are slightly baffling (for me anyway). I have been advised to concentrate on

  • Frame pair clustering F-measure: how well corresponding sections correspond; this measures getting matching segment types right. Segmentino does very well here, except in one dataset for some reason. The QM Segmenter is not as good, but actually not so bad either.
  • Segment boundary recovery evaluation measures: how accurately the segmenters report the precise locations of segment boundaries. Neither of our submissions does this very well, although Segmentino does well on precision at 3 seconds, meaning the segment boundaries it does report are usually fairly close to the real ones.

This is a pretty good result—I think!