I am trying to get the spectrogram as described by the following instruction.
Each audio segment has duration of 5s. Frames with equal size are extracted from the audio (with overlap between the consecutive frames), and each of the frame consists of 1024 samples. The mel-scale is divided into 128 bins. Therefore, the spectrogram for the audio segment has the dimension of 192×128.
To my knowledge, this instruction implies n_mels=128
and n_fft=1024
in the melspectrogram
function. So I tried to get a spectrogram with the following code:
from librosa import load, power_to_db
from librosa.display import specshow
from librosa.feature import melspectrogram
audio_path = r'5s.wav'
y, sr = load(audio_path,sr=44100)
S = melspectrogram(y,sr,n_mels=128,n_fft=1024,hop_length=512)
print(S.shape)
The shape of y
is (220500,)
and the sampling rate sr
is 44100
And the spectrogram shape I get is (128, 431)
. The 128 mel-scale
size is correct, yet the number of frames I get is 431
instead of 192
frames mentioned in the instruction.
In order to get 192 frames, I changed the sampling rate to 22050
and keep adjusting the hop_lenght
until the spectrogram has 192
frames:
audio_path = r'5s.wav'
y, sr = load(audio_path,sr=22050)
S = melspectrogram(y,sr,n_mels=128,n_fft=1024,hop_length=575)
print(S.shape)
However, I am not sure if it is the correct way to get the spectrogram dimension that I want. It seems the process is just trial and error. I wonder if there is a more scientific way to get a spectrogram with the shape that I want without guessing the parameter values?