I've built a U-Net model to perform audio mixing of multitrack audio, for which I've used 20s clips of the audio tracks (converted into spectrograms) as input in training the model. However the training process is incredibly long, so I think it would be better to take 2s clips from each track to train the model.
The data is organised as 8 stems (individual instrument tracks) as the inputs and a single mixture of the stems as the target (all have sr=44100
). I want to find the most energetic 2s section of the mixture track and crop all tracks (input and mixture) this specific 2s part. I'm mainly using librosa in my data preparation but I'm unsure what functions to use to find the start point of the loudest (I understand this is ambiguous) 88200 sample segment (2s).