Preparing MFCC audio feature- Should all WAV files be at same length?

Question

I would like to prepare an Audio-dataset for a machine learning model.

Each .wav file should be represented as an MFCC image.

While all of the images will have the same MFCC amount (= 20), the lengths of the .wav files are between 3-5 seconds.

Should I manipulate all the .wav files to have the same length? Should I normalize the MFCC values (between 0 and 1) prior to plotting?

Are there any important steps to do with such data before passing it to a machine learning model?

Further reading links would also be appreciated.

score 2 · Accepted Answer · answered Mar 07 '20 at 22:18

Most classifiers will require a fixed size input, yes. You can do this by cutting or padding the MFCCs after you have calculated them. No need to manipulate the WAV/waveform, per se.

Another approach is to split your audio files into multiple analysis windows, say 1 seconds each. A 3 second file can then be done with 3 predictions (or more if one uses overlap), while a 5 second file would take 5 predictions (or more). Then to get clip-wide prediction, one would merge predictions over all windows in the clip. The easy ways to train in this way requires assuming that a label given for the clip is valid for each individual analysis window.

Preparing MFCC audio feature- Should all WAV files be at same length?

1 Answers1