Mp3 decoding with the MAD library: We’ve all been doing it wrong

The MAD mp3 decoder library is widely used in open source applications that play or edit mp3 audio files.

It’s a respected library that consists of high quality C code, has a fairly friendly API, and was evidently written with great care. It’s now getting old (last updated in 2004) but people trust it.

I discovered this week that I’ve been using this library wrong for many years in a couple of small ways. I checked the code of a few other open source applications that use it, and found that all of them (including widely-used programs like Audacity) suffered at least one of the same problems as mine did. We’ve all been doing it wrong.

Here’s what almost every user of this library seems to be doing wrong:

  1. If an mp3 file starts with a Xing/LAME information frame, they are feeding that frame to the mp3 decoder rather than filtering it out, resulting in an unnecessary 1152 samples of silence at the start of the decoded audio. (This is in addition to the variable mp3 encoder delay, and note that the metadata frame is not the same thing as an id3 tag — those are not actually mp3 frames and so don’t have the same problem.)
  2. More importantly, they aren’t providing the decoder an expected but undocumented small block of zero data at the end of the file. Without this, it loses synchronisation on the last mp3 frame, which is consequently never decoded. This causes the decoded audio to be truncated by up to 1152 samples.

Here’s an example audio file you can use to check an application: (audio file link). This file contains two very short bursts of noise, one right at the start of the file and the other at the end, separated by a second and a half or so of silence.

After decoding with MAD, the first burst should start around 0.025 seconds in, and the second should finish just before the end of the decoded audio.

If you load this in an application that uses MAD and find the first burst starts around 0.05 sec, then you have the first of the above problems. If only one of the two bursts is there, or the second is shorter than the first, then you have the second.

My own Sonic Visualiser v2.5 suffers from both problems:

screenshot-from-2016-11-26-19-22-00

But both are fixed in the repository, and will be fixed in the forthcoming release:

screenshot-from-2016-11-26-19-23-28

(If both bursts are there and they appear exactly at the start and end of the file without any padding silence at all, then your decoder not only handles these details correctly but also interprets the LAME information frame and accounts for the encoder delay and padding listed in there. Sonic Visualiser doesn’t do that even after this fix, but that could change!)

I’ve also started feeding some fixes to a few other projects (e.g. this pull request for the more serious of those problems in Audacity).

The root of the problem I think is that MAD is an mp3 stream decoder and not an mp3 file decoder. These two things are almost the same, as an mp3 file is just a sequence of stream frames with no file header: if you concatenate two mp3 files you get a valid mp3 file containing the concatenation of the two audio streams. But the fact that MAD doesn’t deal with files means that it doesn’t know when a file has ended, and it doesn’t know about file metadata frames, and these turn out to be things you have to handle in the calling code.

Users of the library maybe don’t realise this because the documentation is quite limited. Developers are pointed to an example program (called minimad) which itself fails to deal with either of these things. There is an official program called madplay that handles both of them properly and could serve as an example, but people don’t seem to be all that conscious of it — it isn’t widely packaged for Linux distributions for example, and until this week I had never looked at its source code.

There ought to be lessons here for both library users and library authors, but I’m not completely sure what those lessons are.

Library users should be testing their import code by comparison with expected decoded data, but I was actually already doing that and I still missed both problems. (I allowed for the mp3 encoder delay by accommodating any amount of leading silence in my tests, so I missed that there was more than there should be; and I foolishly checked whether the decoded data matched the expected data throughout its extent rather than the other way around, so missing that it had been truncated.)

This is probably also a case for using higher-level libraries like CoreAudio (or gstreamer, except that I think gstreamer also gets this wrong in its MAD plugin). Using format-specific open source libraries gives you consistent portability across platforms from a single codebase, but that doesn’t help much if you are deceived by the differences between different format libraries and end up not using them correctly.

For library authors the lesson really seems to be that people will copy the code you give them expecting it to be a complete example for the most obvious use case. If the two don’t match, there’ll be trouble.

I’d be interested to hear about any examples of open source software that get the MAD decoder right.

Chordino troubles

On September the 9th, I released a v1.0 build of the Chordino and NNLS Chroma Vamp plugin. This plugin analyses audio recordings of music and calculates some harmonic features, including an estimated chord transcription. When used with Sonic Visualiser, Chordino is potentially very useful for anyone who likes to play along with songs, as well as for research.

Chordino was written by Matthias Mauch, based on his own research. Although I made this 1.0 release, my work on it only really extended as far as fixing some bugs found in earlier releases using the Vamp Plugin Tester.

Unfortunately, with one of those fixes, I broke the plugin. The supposedly more reliable 1.0 update was substantially less accurate at identifying both chord-change boundaries and the chords themselves than any previous version.

I didn’t notice. Nor did Matthias, who had recently left our research group and was busy starting at a new job. One colleague sent me an email saying he had problems with the new release, but I jumped to the completely wrong conclusion that this had something to do with parameter settings having changed since the last release, and suggested he raise it with Matthias.

I only realised what had happened after we submitted the plugin to MIREX evaluation, something we do routinely every year for plugins published by C4DM, when the MIREX task captain Johan Pauwels emailed to ask whether I had expected its scores to drop by 15% from the previous year’s submission (see results pages for 2014, 2015). By that time, the broken plugin had been available for over a month.

This is obviously hugely embarrassing—perhaps the most unambiguous screwup of my whole programming career so far. As the supposed professional software developer of my research group, I took someone else’s working code, broke it, published and promoted the broken version with his name all over it, and then submitted it to a public evaluation, again with his name on it, where its brokenness was finally made pointed out to me, a month later, by someone else. Any regression test, even on only a single audio file, would have shown up the problem immediately. Regression-testing this sort of software can be tricky, but the simplest possible test would have worked here. And a particularly nice irony is provided by the fact that I’ve just come from a four-year project whose aims included trying to improve the way software is tested in academia.

I’ve now published a fixed version of the plugin (v1.1), available for download here. This one has been regression tested against known-good output, and the tests are in the repository for future use. The broken version is actually gone from the download page (though of course it is still tagged in the source repository), to avoid anyone getting the wrong one by accident.

I’m also working on a way to make simple regression tests easier to provide and run, for the other plugins I work on.

That’s all for the “public service announcement” bit of this post; read on only if you’re interested in the details.

What was the change that broke it? Well, it was a change I made after running the plugin through the Vamp Plugin Tester, a sort of automated fuzz-testing tool that helps you find problems with your code. (Again, there’s an irony here. Using this tool is undoubtedly a good practice, as it can show up all sorts of problems that might not be apparent to developers otherwise. Even so, I should have known well how common it is to introduce bugs while fixing things like compiler warnings and static analysis tests.)

The problem I was trying to fix here was that intermediate floating-point divisions sometimes overflowed, resulting in infinity values in the output. This only happened for unusual inputs, so it appeared reasonable to fix it by clamping intermediate values when they appeared to be blowing up out of the expected range. But I set the threshold too low, so that many intermediate values from legitimate inputs were also being mangled. I then also made a stupid typo that made the results a bit worse still (you can see the change in question around line 500 of the file in this diff).

Note that this only broke the output from the Chordino chord estimator, not the other features calculated by NNLS Chroma.

A digression. An ongoing topic of debate in the world of the Research Software Engineer is whether software development resources in academia should be localised within research groups, or centralised.

The localised approach, which my research group has taken with my own position, employs developers directly within a research subject. The centralised approach, typified by the Research Software Development group at UCL, proposes a group of software developers who are loaned or hired out to research groups according to need and availability.

In theory, the localised approach can be simpler to manage and should increase the likelihood of developers being available to help with small pieces of work requiring subject knowledge at short notice. The centralised approach has the advantage that all developers can share the non-subject-specific parts of their workload and knowhow.

I believe that in general a localised approach is useful, and I suspect it is easier to hire developers for a specific research group than to find developers good enough to be able to parachute in to anywhere from a central team.

In a case like this, though, the localised approach makes for quite a lonely situation.

Companies that produce large software products that work do so not because they employ amazing developers but because they have systems in place to support them: code review, unit testing, regression tests, continuous integration, user acceptance tests.

But for me as a lone professional developer in a research group, it’s essentially my responsibility to provide those safety nets as well as to use them. I had some of them in place for most of the code I work on, but there was a big hole for this particular project. I broke the code, and I didn’t notice because I didn’t have the right tests ready. Neither did the researcher who wrote most of this code, but that wasn’t his job. When some software goes out from this group that I have worked on, it’s my responsibility to make sure that the code aspects of it (as opposed to the underlying methods) work correctly. Part of my job has to be to assume that nobody else will be in a position to help.

 

SoundSoftware tutorial at AES 53

I’ll be co-presenting the first tutorial session at the Audio Engineering Society 53rd Conference on Semantic Audio, this weekend.

(It’s the society’s 53rd Conference, and it happens to be about semantic audio. It’s not their 53rd conference about semantic audio. In fact it’s their second: that was also the theme of the AES 42nd Conference in 2011.

What is semantic audio? Good question, glad you asked. I believe it refers to extraction or estimation of any semantic material from audio, including speech recognition and music information retrieval.)

My tutorial, for the SoundSoftware project, is about developing better and more reliable software during research work. That’s a very deep subject, so at best we’ll barely hint at a few techniques during one programming exercise:

  • making readable experimental code using the IPython Notebook, and sharing code for review with colleagues and supervisors;
  • using version control software to manage revisions and escape disaster;
  • modularising and testing any code that can be used in more than one experiment;
  • packaging, publishing, and licensing code;
  • and the motivations for doing the above.

We presented a session at the Digital Audio Effects (DAFx) conference in 2012 which covered much of this material in presentation style, and a tutorial at the International Society for Music Information Retrieval (ISMIR) in 2012 which featured a “live” example of test-driven development in research software. You can find videos and slides from those tutorials here. The theme of this one is similar, and I’ll be reusing some code from the ISMIR tutorial, but I hope we can make this one a bit more hands-on.

 

How Much Legacy Code Have You Written This Week?

I recently bought a copy (based on a recommendation) of Michael Feathers’ 2005 book Working Effectively with Legacy Code.

This excellent technical book is largely a compendium of refactoring strategies to help software developers insinuate unit tests into existing code.

What I found most striking, though, is a position stated right at the start of the book.

Most programmers have a broadly intuitive idea what “legacy code” means. It’s typically a euphemism for code that they resent, because it’s old and crufty and they have to maintain it unwillingly. There may be a hint of the positive meaning of “legacy”—code handed down to you that does at least work, which is why you have to maintain it. In my mind there’s also a suggestion that the code is alien in some way—written for an obsolete platform or in an obsolete language—which contributes to the difficulty of managing it.

Feathers dismisses most of these possible definitions, writing

In the industry, legacy code is often used as a slang term for difficult-to-change code that we don’t understand. But over years of working with teams, helping them get past serious code problems, I’ve arrived at a different definition.

To me, legacy code is simply code without tests.

[…] With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse.

That’s pretty uncompromising, but it certainly gives you perspective.

If you take this point of view, then every line of code you write that lacks tests—proper tests that properly test, and that other people can run—is an immediate, direct contribution to the sludgy mountain of cruddy code that nobody really understands and that probably doesn’t work.

(Of course, code with tests can still be useless. But at least it has potential.)

I’ve written a lot of legacy code over the past few years. I’ve only quite recently started unit testing routinely in new code I write, and I still need to crisp up my development practice—fewer lines of code per task, more of them demonstrably working. I might not find Feathers’ definition of legacy code entirely satisfactory as a cultural one, but it’s a highly useful stimulus.

SoundSoftware 2012 Workshop

Yesterday the SoundSoftware project, which I help to run, hosted the SoundSoftware 2012 Workshop at Queen Mary. This was a one-day workshop about working practices for researchers developing software and experiences they have had in software work, with an eye to subjects of interest to audio and music researchers.

You can read about the workshop at the earlier link; I’d just like to mention two talks that I found particularly interesting. These were the talk from Paul Walmsley followed by that of David Gavaghan.

Paul is a long-serving senior developer in the Sibelius team at Avid (a group I’m interested in already because of my former life working on the notation editor aspects of Rosegarden: Sibelius was always the gold standard for interactive notation editing). He’s an articulate advocate of unit testing and of what might be described as a decomposition of research work in such a way as to be able to treat every “research output” (for example, presenting a paper) as an experiment demanding reproducibility and traceable provenance.

Usefully, he was able to present ideas like these as simplifying concerns, rather than as arduous reporting requirements. At one point he remarked that he could have shaved six months off his PhD if he had known about unit testing at the time—a remark that proved a popular soundbite when we quoted it through the SoundSoftware tweeter.

(I have an “if only I’d been doing this earlier” feeling about this as well: Rosegarden now contains hundreds of thousands of lines of essentially untested code, much of which is very fragile. Paul noted that much of the Sibelius code also predates this style of working, but that they have been making progress in building up test cases from the low-level works upward.)

David Gavaghan took this theme and expanded on it, with the presentation of his biomedical project Chaste (for “cancer, heart, and soft tissue environment”). This remarkable project, from well outside the usual field we usually hear about in the Centre for Digital Music, was built from scratch in a test-driven development process—which David refers to as “testfirst”. It started with a very short intensive project for a number of students, which so exercised the people involved that they voluntarily continued work on it up to its present form: half a million lines of code with almost 100% test coverage that has proven to avoid many of the pitfalls found in other biomedical simulation software.

Small conclusions about APIs and testing

In my previous post I explained a small but significant API change for v0.9 of the Dataquay library.

Although there was nothing very deep about this change or its causes, I found it interesting partly because I had used a partly test-driven process to evolve the original API and I felt there may be a connection between the process and any resulting problems. Here are a few thoughts prompted by this change.

Passing the tests is not enough

Test-driven development is a satisfying and welcome prop. It allows you to reframe difficult questions of algorithm design in terms of easier questions about what an algorithm should produce.

But producing the right results in every test case you can think of is not enough. It’s possible to exercise almost the whole of your implementation in terms of static coverage, yet still have the wrong API.

In other words, it may be just as easy to overfit the API to the test cases as it is to overfit the test cases to the implementation.

Unit testing may be easier than API design

So, designing a good API is harder than writing tests for it. But to rephrase that more encouragingly: writing tests is easier than designing the API.

If, like me, you’re used to thinking of unit testing as requiring more effort than “just bunging together an API”, this should be a worthwhile corrective in both directions.

API design is harder than you think, but unit testing is easier. Having unit tests doesn’t make it any harder to change the API, either: maintaining tests during redesign is seldom difficult, and having tests helps to ensure the logic doesn’t get broken.

Types are not just annoying artifacts of the programming language

An unfortunate consequence of having worked with data representation systems like RDF mostly in the context of Web backends and scripting languages is that it leads to a tendency to treat everything as “just a string”.

This is fine if your string has enough syntax to be able to distinguish types properly by parsing it—for example, if you represent RDF using Turtle and query it using SPARQL.

But if you break down your data model into individual node components while continuing to represent those as untyped strings, you’re going to be in trouble. You can’t get away without understanding, and somewhere making explicit, the underlying type model.

Predictability contributes to simplicity

A simpler API is not necessarily one that leads to fewer or shorter lines of code. It’s one that leads to less confusion and more certainty, and carrying around type information helps, just as precondition testing and fail-fast principles can.

It’s probably still wrong

I’ve effectively found and fixed a bug, one that happened to be in the API rather than the implementation. But there are probably still many remaining. I need a broader population of software using the library before I can be really confident that the API works.

Of course it’s not unusual to see significant API holes in 1.0 releases of a library, and to get them tightened up for 2.0. It’s not the end of the world. But it ought to be easier and cheaper to fix these things earlier rather than later.

Now, I wonder what else is wrong…