I'm working on a sound classification project, given a set of audio recordings I try to determine which class a certain recording would fall into. You might compare this to a music genre or topic recognition (of a body of text) problem, my samples are of varying lengths and I need to assign each sample precisely a single label.
I represent my features as 2d matrices, where each column represents a frame in the audio file (ex. 0.1 seconds) and each row is a feature pertaining solely to that time frame (ex. MFCC coefficients). Now although my row-count will be fixed, the number of columns will vary depending on the length of the recording.
I feed in my training and testing data as numpy arrays, they contain a 2D n x y matrix for each sample, where n is a constant (i.e. 13) and y is a variable, which dependent on the length of the current sample.
Unfortunately, scikit-learn doesn't seem to be a big fan of this, time and time again raising me a ValueError: setting an array element with a sequence.
. Now I've seen a number of solutions:
- Using one of the gadgets in
sklearn.feature_extraction
to vectorize features in a sequence (text, images of varying size etc.), though most examples I've seen are for text-based problems, so I'm not entirely sure how applicable they are for an audio problem like this. - Taking the mean of the columns to produce a single time-independent feature vector (as can be seen here https://www.youtube.com/watch?v=N1rcKBHlw-Y)
- When using a model like a K-NN, distances can be pre-computated manually, circumventing scikit's "sequence or array?" checks altogether.
Now of these three, I would prefer something similar to #1, since it feels like this is the approach scikit is optimized for. Any ideas?