Mp3 decoding with the MAD library: We’ve all been doing it wrong

The MAD mp3 decoder library is widely used in open source applications that play or edit mp3 audio files.

It’s a respected library that consists of high quality C code, has a fairly friendly API, and was evidently written with great care. It’s now getting old (last updated in 2004) but people trust it.

I discovered this week that I’ve been using this library wrong for many years in a couple of small ways. I checked the code of a few other open source applications that use it, and found that all of them (including widely-used programs like Audacity) suffered at least one of the same problems as mine did. We’ve all been doing it wrong.

Here’s what almost every user of this library seems to be doing wrong:

  1. If an mp3 file starts with a Xing/LAME information frame, they are feeding that frame to the mp3 decoder rather than filtering it out, resulting in an unnecessary 1152 samples of silence at the start of the decoded audio. (This is in addition to the variable mp3 encoder delay, and note that the metadata frame is not the same thing as an id3 tag — those are not actually mp3 frames and so don’t have the same problem.)
  2. More importantly, they aren’t providing the decoder an expected but undocumented small block of zero data at the end of the file. Without this, it loses synchronisation on the last mp3 frame, which is consequently never decoded. This causes the decoded audio to be truncated by up to 1152 samples.

Here’s an example audio file you can use to check an application: (audio file link). This file contains two very short bursts of noise, one right at the start of the file and the other at the end, separated by a second and a half or so of silence.

After decoding with MAD, the first burst should start around 0.025 seconds in, and the second should finish just before the end of the decoded audio.

If you load this in an application that uses MAD and find the first burst starts around 0.05 sec, then you have the first of the above problems. If only one of the two bursts is there, or the second is shorter than the first, then you have the second.

My own Sonic Visualiser v2.5 suffers from both problems:

screenshot-from-2016-11-26-19-22-00

But both are fixed in the repository, and will be fixed in the forthcoming release:

screenshot-from-2016-11-26-19-23-28

(If both bursts are there and they appear exactly at the start and end of the file without any padding silence at all, then your decoder not only handles these details correctly but also interprets the LAME information frame and accounts for the encoder delay and padding listed in there. Sonic Visualiser doesn’t do that even after this fix, but that could change!)

I’ve also started feeding some fixes to a few other projects (e.g. this pull request for the more serious of those problems in Audacity).

The root of the problem I think is that MAD is an mp3 stream decoder and not an mp3 file decoder. These two things are almost the same, as an mp3 file is just a sequence of stream frames with no file header: if you concatenate two mp3 files you get a valid mp3 file containing the concatenation of the two audio streams. But the fact that MAD doesn’t deal with files means that it doesn’t know when a file has ended, and it doesn’t know about file metadata frames, and these turn out to be things you have to handle in the calling code.

Users of the library maybe don’t realise this because the documentation is quite limited. Developers are pointed to an example program (called minimad) which itself fails to deal with either of these things. There is an official program called madplay that handles both of them properly and could serve as an example, but people don’t seem to be all that conscious of it — it isn’t widely packaged for Linux distributions for example, and until this week I had never looked at its source code.

There ought to be lessons here for both library users and library authors, but I’m not completely sure what those lessons are.

Library users should be testing their import code by comparison with expected decoded data, but I was actually already doing that and I still missed both problems. (I allowed for the mp3 encoder delay by accommodating any amount of leading silence in my tests, so I missed that there was more than there should be; and I foolishly checked whether the decoded data matched the expected data throughout its extent rather than the other way around, so missing that it had been truncated.)

This is probably also a case for using higher-level libraries like CoreAudio (or gstreamer, except that I think gstreamer also gets this wrong in its MAD plugin). Using format-specific open source libraries gives you consistent portability across platforms from a single codebase, but that doesn’t help much if you are deceived by the differences between different format libraries and end up not using them correctly.

For library authors the lesson really seems to be that people will copy the code you give them expecting it to be a complete example for the most obvious use case. If the two don’t match, there’ll be trouble.

I’d be interested to hear about any examples of open source software that get the MAD decoder right.