Patrick Flandrin is a physicist and signal-processing researcher whose name I first encountered as co-author (with François Auger) of a 1995 IEEE Transactions on Signal Processing paper called “Improving the Readability of Time-Frequency and Time-Scale Representations by the Reassignment Method”.
This crunchy publication (21 pages, dozens of equations and figures) took a pleasing idea — replacing the familiar grid-format time-frequency spectrogram with a field of precisely localised points calculated using both magnitude and phase of the frequency bins, rather than only magnitude as a traditional spectrogram does — and set out the mathematics of applying it to a number of different time-frequency and time-scale representations.
I read this paper about 15 years ago and didn’t understand it. I have since realised this is partly because it isn’t all that clear with its notation, but there is also a big gap between the naive programmer’s view (that’s mine) of a spectrogram and the mathematical analysis used in the paper.
To explain. For a programmer, a spectrogram comes from taking short overlapping slices of a sampled signal, multiplying each by a smoothing window shape, applying a short-time Fourier transform, and taking the magnitudes of the complex output bins to get one column of the spectrogram per slice of input. The short slices are because you want a fixed, smallish number of output bins, and you have various tradeoffs — time and frequency resolution and computational efficiency — to consider in that. The smoothing window is because your Fourier transform — a thing which matches up sinusoids of different frequencies against a signal to identify which ones would add up to it — operates on an infinite signal, consisting of the input you give it repeated forever in both directions: this will have a discontinuity each time it wraps around, and the smoothing window removes some of the frequency artifacts from these discontinuities. There is nothing particularly mathematical about the implementation of this, and any intuition used by the programmer is a mixture of the visual and techniques from the world of engineering. The language used in a publication like the DAFx book is typical in this world.
The Auger & Flandrin paper instead comes from a world that summarises a spectrogram as a two-dimensional Wigner-Ville distribution filtered with a smoothing window leading to a time-frequency representation of the Cohen’s class. Signals are finite-energy functions over infinite domains, and a spectrogram is a double integral over time and angular frequency. Both time-domain functions and time-frequency representations are continuous, and practical questions about overlap and window length don’t arise. I can dimly remember this world, because my undergraduate degree — who am I kidding, my only degree — started out as pure maths, but I haven’t inhabited it for any of my working life.
So I didn’t really understand the paper, and a programmer has plenty to do, and that is one reason why Sonic Visualiser’s “Peak-Frequency Spectrogram” layer calculates instantaneous frequencies from the phase difference between successive columns, something which I found much easier to understand. (It turns out there are other good reasons one could make this choice, but I didn’t know that.1)
Returning to the paper recently, I learned that Flandrin had written a book on the subject, and I bought a copy hoping it might bridge the conceptual gap. It turned out to be a good experience.
* * *
“Explorations in Time-Frequency Analysis” is a monograph digressing on things the author has found interesting in the past 30 years, which — what luck! — happen to be about time-frequency analysis. It’s short, about 200 pages, and nicely printed. There are lots of diagrams, and although equation-heavy it doesn’t hang about proving things, sending you to the references instead. It begins with a glossary of notation (I like it when books do this) and ends with a 9-page bibliography. The writing is crisp and friendly and the scene is set by the first two chapters, a philosophical outline and a chapter of examples with the lovely title “Small Data Are Beautiful”.
Although the book provides a lot of the background to the paper that defeated me, I still spent a potentially embarrassing amount of thought on things I imagine that anyone properly within the target market finds obvious. An example is what it means for a Gaussian function to be “circular” in time and frequency. The book goes over this in far more detail, but briefly a Gaussian — the bell-shaped normal distribution curve found in probability — has the property that its Fourier transform is also a Gaussian. The “wider” the bell shape in the time domain, the “narrower” in the frequency domain: at some point it must be equal in both, and then if you plot it in a spectrogram-like heat map you will see a circle. When does this happen? It’s shown that it happens for the Gaussian corresponding to a normal distribution of variance 1. But at this point I am worrying about units. What does it mean to be circular? The figures illustrating this lack units in either axis — in fact detail-wise many of the figures are more like sketches — and the little bit of engineer in me is wondering: how can you possibly have a circle if you lack units?
The answer I eventually recalled is that the units in one domain define those in the other. In this case, if the time axis is in seconds then angular frequency is radians per second, and a circle is a distribution whose extent in seconds is the same as that in radians per second. Other units such as samples (in time) or STFT bins (in frequency) have similar correspondences in the other domain. This is a place where going back to basics took significant thought, but I did actually appreciate being expected to think about it.
So a nice rehearsal with some interesting bumps, but for me the thrilling twist arrives in chapter 12, “Spectrogram Geometry 2”. This reframes the spectrogram as a complex plane and the reassignment operator in terms of motion in a potential field proportional to the log-spectrogram. This mathematical leap is also an intuitively visual one, and it’s exciting for me because it is a little like how I pictured the spectrogram, with no meaningful mathematical analysis, when developing a certain feature of the Rubber Band timestretcher.2 This chapter is like seeing the vaguely-realised ground beneath your feet resolve into a larger, recognisable object — the moment when you realise you are standing on the back of a giant Pokémon, if you will.
There is a lot more in this book, and I think it will repay repeated visits. I’m not sure whether you could implement anything directly from it, but you could, say, pick a random page and follow up all the references until you really feel you understand it. I think this would be a rewarding exercise that, for someone like me, would probably take around a month per page.
* * *
On that note, one of the first references given is to a book called “Visible Speech” by Potter, Kopp, and Green, 1947. I looked this up and was so intrigued that I tracked down an ex-library copy. It is a lavish presentation, perhaps with both training and PR elements, of a then-new idea called the “sound spectrograph”, i.e. a spectrogram. The title “Visible Speech”, incidentally, is borrowed with attribution from an earlier (1867) work about phonetic alphabets.
The authors of the 1947 book were writing about work done at Bell Labs to try to make the telephone accessible to the deaf. Their experimental devices used paper tape or phosphor display to show spectrographs of the speech sounds, and users were specially trained to interpret speech from them. Here’s a picture from the book of someone using one.
The spectrographs were produced by automatically recording the speech to tape and playing the tape repeatedly through a filter of 300Hz bandwidth, whose centre frequency was incremented linearly between passes in 15Hz steps from 0-3500Hz. (They also had a version using 45Hz bandwidth filters, but it was found to be less legible.) The system was of course analogue.
In this image the top spectrograph is the one with 45Hz bandwidth, which is used to point out some interesting features, but the 300Hz bandwidth spectrograph below it is the form used throughout the rest of the book:
It’s striking how clear these spectrographs are, and it makes a useful reminder that we really aren’t always looking for the most precise representation of something — 300Hz bandwidth at speech frequencies is pretty wide! — but instead the most appropriate in some human dimension.
1 The Sonic Visualiser peak-frequency spectrogram precisely localises stable frequencies, but for each frequency bin it draws a short horizontal line across the whole duration of the bin at the proper frequency rather than localise the bin to a point in time. A very similar output could have been produced using reassignment, because the frequency calculated from phase difference should be very close to that calculated with reassignment. But a decision to do that would have meant ignoring the other reassignment operator, localisation in time, which gives a single point rather than a horizontal line for each bin. Had I understood the reassignment paper, I would probably have felt compelled to do that part properly. For it to work well, a greater bin overlap and much more sophisticated rendering would have been needed, and the result would have been much slower and possibly less clear for real music. I think.
2 This feature, which I gave the vague name “phase lamination”, was worked out in a hurry after discovering that the “phase locking” technique of Jean Laroche and Mark Dolson which I had used in the very first release of Rubber Band was patented. Phase locking reduced audible phasiness with the nice side-effect of making the phase vocoder faster to compute, but it also lent a robotic tang to the sound which certain listeners found even more unpleasant than the phasiness. The scheme I came up with to replace it was based on picturing a gradient field and making adjustments to bins near a peak or trough in proportion to the distance from it — tuned by ear rather than worked out mathematically. Although it lost the improved speed of phase locking, it usually sounds better. The idea seems reasonably obvious, but I hadn’t seen it described anywhere else and I was delighted to find it.