2
result=librosa.feature.mfcc(signal, 16000, n_mfcc=13, n_fft=2048, hop_length=400)
result.shape()

The signal is 1 second long with sampling rate of 16000, I compute 13 MFCC with 400 hop length. The output dimensions are (13,41). Why do I get 41 frames, isn't it supposed to be (time*sr/hop_length)=40?

Hendrik
  • 5,085
  • 24
  • 56
Rasula
  • 47
  • 1
  • 5

1 Answers1

4

TL;DR answer

Yes, it is correct.

Long answer

You are using a time-series as input (signal), which means that librosa first computes a mel spectrogram using the melspectrogram function. It takes a bunch of arguments, of which you have already specified one (n_fft). It's important to note that melspectrogram also offers the two parameters center and pad_mode with the default values True and "reflect" respectively.

From the docs:

pad_mode: string: If center=True, the padding mode to use at the edges of the signal. By default, STFT uses reflection padding.

center: boolean: If True, the signal y is padded so that frame t is centered at y[t * hop_length]. If False, then frame t begins at y[t * hop_length]

In other words, by default, librosa makes your signal longer (pads) in order to support centering.

If you'd like to avoid this behavior, you should to pass center=False to your mfcc call.

That all said, when setting center to False, keep in mind that with an n_fft length of 2048 and a hop length of 400, you don't necessarily get (time*sr/hop_length)=40 frames, because you have to also account for the window and not just the hop length (unless you pad somehow). Hop length just specifies by how many samples you move that window.

To give an extreme example, consider a very large window and a very short hop length: Assume 10 samples (e.g. time=1s, sr=10Hz), a window length of n_fft=9 and hop_length=1 with center=False. Now imagine sliding the window over the 10 samples.

   ◼︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎◻︎
   ◻︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎
t  0123456789

◻︎ sample not covered by window
◼︎ sample covered by window

At first the window starts at t=0 and ends at t=8. How many times can we shift it by hop_length and still expect it to not run out of samples? Exactly once, until it starts at t=1 and ends at t=9. Add the first unshifted one and you arrive at 2 frames. This is obviously different from the incorrect (time*sr/hop_length)=1*10/1=10.

Correct would be: (time*sr-n_fft)//hop_length+1=(1*10-9)//1+1=2 with // denoting Python-style integer division.

When using the default, i.e. center=True, the signal is padded with n_fft // 2 samples on both ends, so n_fft falls out of the equation.

Hendrik
  • 5,085
  • 24
  • 56
  • 1
    Are you aware why I get 41 ? It seems the number of frames is always ((time*sr/hop_length))+1. No matter the (nfft) passed. – Rasula Jul 02 '21 at 12:06
  • When `center=True`, *librosa* pads with `n_fft // 2` samples on either end. So in essence, `n_fft` does have no effect on the number of frames. But with `center=False`, that's not the case and you have to take `n_fft` into account. I'll add this to my answer. – Hendrik Jul 04 '21 at 20:24
  • @Hendrik can you explain on how **window** cause the number of frames to be 41 by showing a calculation example? – Chong Onn Keat Sep 26 '21 at 11:38
  • I didn't write that the *window* length causes the number of frames to be 41. I just wrote that, if you turn `center` off, you have to take the window length into account. – Hendrik Sep 27 '21 at 12:22