Audio Feature Extraction using FFT, PSD and STFT and Finding The Most Powerful Frequencies

Question

1) Let's assume I have FFT and STFT coefficients obtained using F = fft(x) and S = spectrogram(x). How can these coefficients be used as audio features? (Here audio feature is used as in the pattern recognition sense).

2) Does the following code give the PSD and the most powerful frequency (in kHz) in the signal?

Hs = spectrogram.periodogram;
p = psd(Hs, x, 'Fs', 22050);
[C, I] = max(p.data);
max_f = p.Frequencies(I);

3) If (2) is OK, how can I find the most powerful n frequencies in the signal using the PSD?

4) How can I find the most powerful frequencies using FFT and/or STFT similar to PSD?

Thanks in advance.

See also: http://stackoverflow.com/questions/27546476/what-fft-descriptors-should-be-used-as-feature-to-implement-classification-or-cl/27546643#27546643 — DrKoch, Dec 22 '14 at 07:00

score 2 · Answer 1 · answered Jan 21 '13 at 06:56

1) S = spectrogram(x) gives you the FFT as a function of time by subdividing the signal x into multiple parts and computing the PSD for each part. fft(X) gives you the fft for the entire signal in one go. The former is more likely to track changes in frequency content, whereas the latter is more useful to look at the overall frequency content. I am not too familiar with audio processing, but even if two signals have identical power spectrums, minor changes in the complex phase of the FFT can result in dramatically different signals in the time domain.

2) The syntax seems a bit different from what I am used to in Matlab, but the answer is YES. The units of the frequency depends on the exact syntax that you have used.

3) You can use the sort function to get the n most powerful frequency bins.. For exmaple, [B,IX] = sort(p.data) and freq_maxn = p.Frequencies(IX(1:n))

4) PSD = |FFT|^2/N. In other words, PSD is simply a scaled version of the squared magnitude of the FFT. However, for real valued signals only half the FFT is used since the other half is simply a complex conjugate. Once you have that sequence the calculation of the maximum frequency and first n frequencies remains the same as (2) and (3). See [periodogram] (http://www.mathworks.com/help/signal/ref/periodogram.html) for more information.

1) I wanted to know how these transforms are used as audio features, but your explanation is good to clarify the concepts. 2) I am not sure if the p.Frequencies contains the exact frequency range of the audio file or a scaling is needed. 4) FFT gives an array whose length is equal to the length of the time domain signal. So this apparently is not in the frequency scale. So when plotting, you can define an axis and scale it; but when reading those values, something different (resampling?) is needed. Thanks by the way. — groove, Jan 24 '13 at 14:28

the_mandrill · Answer 2 · 2013-01-24T23:09:19.993

I think you need to define what you mean by 'audio features'. There are many different types of feature depending on what you are trying to achieve (eg see some of the ones featured in these papers).

When you talk about 'most powerful frequency' I assume that you are wanting to do some form of pitch detection? If that is the case then the peak of the PSD will indeed give the most dominant frequency, however that isn't necessarily the pitch that you hear. For instance an instrument may be playing a note at 200Hz which will have spectral peaks at 200, 400, 600, 800, etc, and it's not necessarily the case that 200Hz will be the highest amplitude. In fact, you could apply a low-pass filter to remove the 200Hz component and you would still perceive that to be the pitch (you hear this effect if you hear music over the phone - it's called Virtual Pitch).

If you want to detect pitch then I would suggest reading up on Pitch Estimation algorithms.

EDIT: There's quite a few papers out there with research on audio classification, so have a search for work by Eric Scheirer, George Tzanetakis and Martin McKinney among others. I'd also sign up to the MIR mailing list as there's lots of the core people in this area on that list and the archives have got lots of useful stuff. As for your question about 'most powerful frequency', I don't quite understand what you mean by it. When listening to music with more than one instrument playing then in general there is no one dominant frequency. There is often a perceptible melody which by virtue of the mix is often prominent, but I'm not sure if that's what you mean.

I want to do song classification. Artist prediction, genre classification, emotion detection, maybe fingerprinting, or whatever. By features I mean the feature vectors that consist of values such as MFCCs. For the pitch detection part, I understand that "the most powerful frequency" is not always the pitch that we hear. But is the most powerful frequency always the fundamental frequency (200Hz for your example)? So we hear what - the most powerful one, the fundamental, or a harmonic? Thanks by the way. — groove, Jan 24 '13 at 14:15

Audio Feature Extraction using FFT, PSD and STFT and Finding The Most Powerful Frequencies

2 Answers2