How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)?

Question

I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:

y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472

n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)

# M.shape = (64, 532)

(Also I'm not sure how to complete that n_fft parameter.) Then, it's said:

Use a context window of 64 frames to divide the whole log Mel-spectrogram into audio segments with size 64x64. A shift size of 30 frames is used during the segmentation, i.e. two adjacent segments are overlapped with 30 frames. Each divided segment hence has a length of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.

So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.

It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition" — user2687945, Jan 23 '19 at 18:07

Jon Nordby · Accepted Answer · 2019-01-23T12:23:45.500

1

Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.

import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size 
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

    window = frames[:, frame_idx-window_size:frame_idx]
    assert window.shape == (n_mels, window_size)
    print('classify window', frame_idx, window.shape)

will output

classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)

However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.

edited Jan 23 '19 at 12:23

answered Jan 23 '19 at 11:57

Jon Nordby

5,494
1
21
50

Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification? – user2687945 Jan 23 '19 at 18:16
Do you have labels per segment or for the whole files? – Jon Nordby Jan 24 '19 at 09:15
I set labels for each segment during the training stage. – user2687945 Jan 24 '19 at 14:16
Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file. – Jon Nordby Jan 25 '19 at 10:23
Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach! – user2687945 Jan 25 '19 at 15:09
1

I wrote about Multiple Instance Learning for audio here, https://stackoverflow.com/questions/55272508/keras-how-to-write-customized-loss-function-to-aggregate-over-frame-level-predi – Jon Nordby Apr 14 '19 at 22:05
And about voting across windows here: https://stackoverflow.com/questions/53862626/keras-how-to-aggregate-over-frame-level-predictions-to-song-level-prediction – Jon Nordby Apr 14 '19 at 22:07

How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)?

1 Answers1

Linked