3

I'm testing the MFCC feature from tensorflow.signal implementation. According to the example (https://www.tensorflow.org/api_docs/python/tf/signal/mfccs_from_log_mel_spectrograms), it is computing all 80 mfccs and then taking the first 13.

I have tried both above and "compute first 13 directly" approach and the result is very different:

All 80 first, then take first 13: enter image description here

Compute first 13 directly: flow

Why the big difference, and which one should I use if I'm passing this as feature to CNN or RNN?

TYZ
  • 8,466
  • 5
  • 29
  • 60
  • thanks for your question, I have learned from it. I just have a query on how did you display the MFCC. I am trying to achieve that but I could not. – Fatimah Mohmmed Mar 14 '22 at 08:57

1 Answers1

6

That's because of the nature of MFCC. Remember that these coefficients are calculated over the frequency range on the mel scale that you provide via lower_edge_hertz and upper_edge_hertz in the linked code.

What it means in practice:

  • "Calculate 13 coefficients directly": take frequency range [80.0, 7600.0] and divide it into 13 bins. Eventually, you will get 13 coefficients that reflect amplitudes of the corresponding spectrum (see MFCC algorithm)

  • "All 80 first, then take first 13": take frequency range [80.0, 7600.0] and divide it into 80 bins. Now, take only first 13 coefficients. In practice, that means that you're looking into much narrow and fine grained spectrum, in this case roughly in the human speech frequency range [80, 400] Hz (roughly speaking, back of the envelope calculations). Makes sense if you're into human speech recognition, as you can focus on more subtle variations while ignoring the higher frequency spectrum (that is less interesting from our audiory system perspective).

Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53
  • This makes complete sense! How could I forget about this. Thank you for pointing this out. I guess in all cases, calculating using the number of bins that I will be using at the end would make more sense, right? I.e., if I only need to use 13 features, then do the calculating with 13 coefficients. – TYZ Mar 03 '20 at 14:25
  • 1
    @TYZ No, it depends on the use case. If you're into human speech recognition, then it makes sense to focus on human frequency range. If you do birds, then you'll be likely interested in the full frequency spectrum and they have much wider vocalisation range. In other words, for ASR (automatic speech recognition), it makes more sense to take e.g. 128 and grab first 40. Or 30. Case dependent :). – Lukasz Tracewski Mar 03 '20 at 15:00
  • 1
    Number of MFCC coefficients and number of mel bins is not the same. – Pablo Riera Aug 15 '20 at 20:31