Why 128 mel bands are used in mel spectrograms?

Question

I am using the mel spectrogram function which can be found here:Mel Spectrogram Librosa

I use it as follows:

signal = librosa.feature.melspectrogram(y=waveform, sr=sample_rate, n_fft=512, n_mels=128)

Why is 128 mel bands use? I understand that the mel filterbank is used to simulate the "filterbank" in human ears, that's why it discriminates higher frequencies.

I am designing and implementing a Speech-to-Text with Deep Learning and when I used n_mels=64, it didn't work at all, it only works with n_mels=128.

Could it because I am normalizing it before injecting it to the network? I am using the librosa.utils.normalize function and it normalizes the mel spectrogram between -1 and 1.

I tried to find where to learn or the reasoning, the only paper I found was this one. Here mel bands from 512 to 128 are being used..... Comparison of Time-Frequency Representations forEnvironmental Sound Classification usingConvolutional Neural Networks

Output examples when n_mels=128

Output examples when n_mels=64 Thanks.

There are plenty of cases where 64 mel bands are used, or even 32. And all kind of values in-between. So if it didn't work at all (what do you mean by that?), likely there were other issues at play — Jon Nordby, Jun 28 '20 at 19:08
I tried the same neural network architecture with n_mels=128 first and it generated good output, most of the speech was converted to text. Then I tried with n_mels=64 (and rest of the parameters,same as before) and then the output was like gibberish and the outputs were like yyyyy, that's it....... Also, like why 32 or 64? I know that mel-filterbanks simulate human's ear "filterbank" but not sure why those numbers..... Thanks. — swe87, Jun 28 '20 at 20:38
Can you quantify that performance difference? It the model stable wrt to hyperparameters, seeds, etc? — Jon Nordby, Jun 28 '20 at 20:58
The numbers chosen are all arbitrary. Any number between 1 and the number of banks in your FFT can be used. Increasing number of banks increases frequency resolution, which can increase performance - but also computational requirements and dataset size requirement — Jon Nordby, Jun 28 '20 at 21:00
Speech occupies a rather small part of the frequency spectrum. So you may want to set fmin=20 and fmax=4000 or so. This might make a difference when n_mels is low. — Jon Nordby, Jun 28 '20 at 21:01
Yes, the model is stable to hyperparameters and seeds, the same model has ben excuted differing only the epochs for n_mels=128 and it has always performed well. With n_mels=64 --> WER = 0.9960 With n_mels=128 --> WER = 0.7632 I will update the post with output examples. — swe87, Jun 28 '20 at 21:05
Ok, yes, I see in librosa there is an option to set fmin and fmax, may be I should try with that one. Thanks — swe87, Jun 28 '20 at 21:08
Do you think it is because I am normalizing it before injecting it to the network? I am using the librosa.utils.normalize function and it normalizes the mel spectrogram between -1 and 1. — swe87, Jun 28 '20 at 22:34
the numbers are not chosen randomly, they are powers of 2 because that makes life easier for your GPU. Using a number different from a power of 2 may lead to decreased performance (speed, not accuracy). — JanLauGe, Jan 02 '21 at 21:21

Why 128 mel bands are used in mel spectrograms?

0 Answers0

Linked