I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:
y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472
n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)
# M.shape = (64, 532)
(Also I'm not sure how to complete that n_fft parameter.) Then, it's said:
Use a context window of 64 frames to divide the whole log Mel-spectrogram into audio segments with size 64x64. A shift size of 30 frames is used during the segmentation, i.e. two adjacent segments are overlapped with 30 frames. Each divided segment hence has a length of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.
So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.