I'm creating a audio classification model for animal sounds. It's a hobby project, just to get myself familiarized with the techniques. The thing that I'm struggling with is the duration differences of my audio clips and how I should cut them into similar duration lengths. It is not so much on the how (because I found many examples on how to split the audio files) but my question is about the duration itself.
My files have some silences but mainly also a lot of repetitive sounds as the dataset is mainly insects. And the insect, like a cricket will make a similar sound, repetitive sound, for a long time. So my idea was: if there is a way to detect repetitions in audio files, use that to split the audio file. And then see what the duration is of the longest clip, and use that as a duration to cut split all the audio files.
But maybe I'm thinking about it all wrong. Does anybody have any suggestions or nice literature for me?