2

When I extract MFCCs from an audio the ouput is (13, 22). What does the number represent? Is it time frames ? I use librosa.

The code is use is:

mfccs = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13, hop_length=256)
mfccs


print(mfccs.shape)

And the ouput is (13,22).

Hendrik
  • 5,085
  • 24
  • 56
ioan_bl
  • 35
  • 8

1 Answers1

5

Yes, it is time frames and mainly depends on how many samples you provide via y and what hop_length you choose.

Example

Say you have 10s of audio sampled at 44.1 kHz (CD quality). When you load it with librosa, it gets resampled to 22,050 Hz (that's the librosa default) and downmixed to one channel (mono). When you then run something like a STFT, melspectrogram, or MFCC, so-called feature frames are computed.

The question is, how many (feature) frames do you get for your 10s of audio?

The deciding parameter for this is the hop_length. For all the mentioned functions, librosa slides a window of a certain length (typically n_fft) over the 1d audio signal, i.e., it looks at one shorter segment (or frame) at a time, computes features for this segment and moves on to the next segment. These segments are usually overlapping. The distance between two such segments is hop_length and it is specified in number of samples. It may be identical to n_fft, but often times hop_length is half or even just a quarter of n_fft. It allows you to control the temporal resolution of your features (the spectral resolution is controlled by n_fft or n_mfcc, depending on what you are actually computing).

10s of audio at 44.1 kHz are 441000 samples. But remember, librosa by default resamples to 22050 Hz, so it's actually only 220500 samples. How many times can we move a segment of some length over these 220500 samples, if we move it by 256 samples in each step? The precise number depends on how long the segment is. But let's ignore that for a second and assume that when we hit the end, we simply zero-pad the input so that we can still compute frames for as long as there is at least some input. Then the computation becomes trivial:

number_of_samples / hop_length = number_of_frames

So for our examples, this would be:

220500 / 256 = 861.3

So we get about 861 frames.

Note that you can make this computation even easier by computing the so-called frame_rate. That's frames per second in Hz. It's:

frame_rate = sample_rate / hop_length = 86.13

To get the number of frames for your input simply multiple frame_rate with the length of your audio and you're set (ignoring padding).

frames = frame_rate * audio_in_seconds
Hendrik
  • 5,085
  • 24
  • 56
  • Thank you very much! So what does it mean when I set the n_fft=0,05*sr? Isn't that 50msec time frame? – ioan_bl Jul 04 '20 at 21:28
  • No, it's not. `n_fft` is specified in samples not in time. I have added a comprehensive example to my answer, so that it's easier to understand. – Hendrik Jul 05 '20 at 09:33
  • Thank you a lot! One more question. For each one of this 861 frames, librosa extracts one MFCCs value? – ioan_bl Jul 06 '20 at 09:02
  • Librosa gives you MFCC values for each frame. In your case 13 per frame, because you asked for `n_mfcc=13`. – Hendrik Jul 06 '20 at 10:08
  • Sorry but I still can't understand something. Since i get an array with dimension (13, 22), which is basically 13 arrays with 22 numbers inside each, how i get 13 for each frame. Thanks a lot – ioan_bl Jul 15 '20 at 11:02
  • 1
    It's a 2 dim array. The first index (0-12) specifies which MFCC you are interested in, the second index (0-21) specifies the frame number (in time). Each point in time, i.e. each *frame* has 13 coefficients. – Hendrik Jul 15 '20 at 15:20