2

I am trying to get the spectrogram as described by the following instruction.

Each audio segment has duration of 5s. Frames with equal size are extracted from the audio (with overlap between the consecutive frames), and each of the frame consists of 1024 samples. The mel-scale is divided into 128 bins. Therefore, the spectrogram for the audio segment has the dimension of 192×128.

To my knowledge, this instruction implies n_mels=128 and n_fft=1024 in the melspectrogram function. So I tried to get a spectrogram with the following code:

from librosa import load, power_to_db
from librosa.display import specshow
from librosa.feature import melspectrogram

audio_path = r'5s.wav'
y, sr = load(audio_path,sr=44100)
S = melspectrogram(y,sr,n_mels=128,n_fft=1024,hop_length=512)
print(S.shape)

The shape of y is (220500,) and the sampling rate sr is 44100 And the spectrogram shape I get is (128, 431). The 128 mel-scale size is correct, yet the number of frames I get is 431 instead of 192 frames mentioned in the instruction.

In order to get 192 frames, I changed the sampling rate to 22050 and keep adjusting the hop_lenghtuntil the spectrogram has 192 frames:

audio_path = r'5s.wav'
y, sr = load(audio_path,sr=22050)
S = melspectrogram(y,sr,n_mels=128,n_fft=1024,hop_length=575)
print(S.shape)

However, I am not sure if it is the correct way to get the spectrogram dimension that I want. It seems the process is just trial and error. I wonder if there is a more scientific way to get a spectrogram with the shape that I want without guessing the parameter values?

Raven Cheuk
  • 2,903
  • 4
  • 27
  • 54

1 Answers1

0

If you divide your y_shape length by the hop_length -- you'll get the number of the frames:

220500 / 512 = 430.6

If you need 192 frames, input 193 * 512 = 98816 samples in y.

lenik
  • 23,228
  • 4
  • 34
  • 43
  • 1
    I thought `hop_length` means the number of samples skipped to get the starting point of next window. And it seems your calculation doesn't work for the second code. Since I have `110250` samples this time, and my hop_length is `564`. By using your calculation `110250/564 = 195.5`, the number of frame is not 192. – Raven Cheuk Jul 09 '18 at 12:44
  • @RavenCheuk if you have N samples and slide a window over those samples skipping K samples every time, how many times you'll be able to slide the window? – lenik Jul 10 '18 at 11:53