1

I am using the mel spectrogram function which can be found here:Mel Spectrogram Librosa

I use it as follows:

signal = librosa.feature.melspectrogram(y=waveform, sr=sample_rate, n_fft=512, n_mels=128)

Why is 128 mel bands use? I understand that the mel filterbank is used to simulate the "filterbank" in human ears, that's why it discriminates higher frequencies.

I am designing and implementing a Speech-to-Text with Deep Learning and when I used n_mels=64, it didn't work at all, it only works with n_mels=128.

Could it because I am normalizing it before injecting it to the network? I am using the librosa.utils.normalize function and it normalizes the mel spectrogram between -1 and 1.

I tried to find where to learn or the reasoning, the only paper I found was this one. Here mel bands from 512 to 128 are being used..... Comparison of Time-Frequency Representations forEnvironmental Sound Classification usingConvolutional Neural Networks

Output examples when n_mels=128 enter image description here

Output examples when n_mels=64 enter image description here Thanks.

swe87
  • 129
  • 1
  • 3
  • 13
  • 1
    There are plenty of cases where 64 mel bands are used, or even 32. And all kind of values in-between. So if it didn't work at all (what do you mean by that?), likely there were other issues at play – Jon Nordby Jun 28 '20 at 19:08
  • I tried the same neural network architecture with n_mels=128 first and it generated good output, most of the speech was converted to text. Then I tried with n_mels=64 (and rest of the parameters,same as before) and then the output was like gibberish and the outputs were like yyyyy, that's it....... Also, like why 32 or 64? I know that mel-filterbanks simulate human's ear "filterbank" but not sure why those numbers..... Thanks. – swe87 Jun 28 '20 at 20:38
  • 1
    Can you quantify that performance difference? It the model stable wrt to hyperparameters, seeds, etc? – Jon Nordby Jun 28 '20 at 20:58
  • The numbers chosen are all arbitrary. Any number between 1 and the number of banks in your FFT can be used. Increasing number of banks increases frequency resolution, which can increase performance - but also computational requirements and dataset size requirement – Jon Nordby Jun 28 '20 at 21:00
  • 2
    Speech occupies a rather small part of the frequency spectrum. So you may want to set fmin=20 and fmax=4000 or so. This might make a difference when n_mels is low. – Jon Nordby Jun 28 '20 at 21:01
  • Yes, the model is stable to hyperparameters and seeds, the same model has ben excuted differing only the epochs for n_mels=128 and it has always performed well. With n_mels=64 --> WER = 0.9960 With n_mels=128 --> WER = 0.7632 I will update the post with output examples. – swe87 Jun 28 '20 at 21:05
  • 1
    Ok, yes, I see in librosa there is an option to set fmin and fmax, may be I should try with that one. Thanks – swe87 Jun 28 '20 at 21:08
  • Do you think it is because I am normalizing it before injecting it to the network? I am using the librosa.utils.normalize function and it normalizes the mel spectrogram between -1 and 1. – swe87 Jun 28 '20 at 22:34
  • 1
    the numbers are not chosen randomly, they are powers of 2 because that makes life easier for your GPU. Using a number different from a power of 2 may lead to decreased performance (speed, not accuracy). – JanLauGe Jan 02 '21 at 21:21

0 Answers0