Identifying the loudest part of an audio track and cropping (Librosa or torchaudio)

Question

I've built a U-Net model to perform audio mixing of multitrack audio, for which I've used 20s clips of the audio tracks (converted into spectrograms) as input in training the model. However the training process is incredibly long, so I think it would be better to take 2s clips from each track to train the model.

The data is organised as 8 stems (individual instrument tracks) as the inputs and a single mixture of the stems as the target (all have sr=44100). I want to find the most energetic 2s section of the mixture track and crop all tracks (input and mixture) this specific 2s part. I'm mainly using librosa in my data preparation but I'm unsure what functions to use to find the start point of the loudest (I understand this is ambiguous) 88200 sample segment (2s).

skumhest · Accepted Answer · 2022-08-06T00:01:52.273

If I am following the question well enough, the below code might be useful as a starting point. It takes in one sound file and locates where it is "loudest" (as you allude to in the question, defining what bit is loudest is not entirely straight-forward) using librosa.feature.rms and then cuts a two second slice out of the original sound file centered on that point:

import librosa

FILENAME = 'soundfile.wav'  # change to path of your sound file
FRAME_LENGTH = 2048
HOP_LENGTH = 512
NUM_SECONDS_OF_SLICE = 2

sound, sr = librosa.load(FILENAME, sr=None)

clip_rms = librosa.feature.rms(y=sound,
                               frame_length=FRAME_LENGTH,
                               hop_length=HOP_LENGTH)

clip_rms = clip_rms.squeeze()
peak_rms_index = clip_rms.argmax()
peak_index = peak_rms_index * HOP_LENGTH + int(FRAME_LENGTH/2)

half_slice_width = int(NUM_SECONDS_OF_SLICE * sr / 2)
left_index = max(0, peak_index - half_slice_width)
right_index = peak_index + half_slice_width
sound_slice = sound[left_index:right_index]

Thanks this works perfectly, I just need to implement it to loop over a dataset now. Just a quick question: some of my data is loaded in stereo (i.e shape of `(2, no. of samples)`, do you know how I would slice the sound whilst retaining the dimensionality? Currently I have `sound_slice = sound[1][left_index:right_index]`, which returns an array of shape `(88200,)` rather than `(2, 88200)`. — Brudalaxe, Aug 07 '22 at 11:33
Sorry, ignore this, my brain isn't working well today - it's obviously just `sound_slice = sound[:,left_index:right_index]` — Brudalaxe, Aug 07 '22 at 11:42

Theo Lamort · Answer 2 · 2023-02-22T09:23:05.070

I found a nice trick that does this

import numpy as np
def crop_loudest(audio, target_length):
    cs = np.cumsum(audio ** 2)
    start = (cs[target_length:] - cs[:-target_length]).argmax()
    return audio[start:start+target_length]

I found it's quite fast. Hope it helps someone!

Edit: explanation
Finding the window of length target_length with maximal RMS is the same as finding the window with the largest sum of squares. If we compute the cumulative sum of squares cs = np.cumsum(audio ** 2), then cs[x + target_length] - cs[x] turns out to be the sum of squares over the window [x: x+target_length]. The array cs[target_length:] - cs[:-target_length] contains exactly this result for x ranging from 0 to len(audio) - target_length. We take the argmax and voila!

Identifying the loudest part of an audio track and cropping (Librosa or torchaudio)

2 Answers2