How to fit a scikit model, for feature-vectors of varying lengths

Question

I'm working on a sound classification project, given a set of audio recordings I try to determine which class a certain recording would fall into. You might compare this to a music genre or topic recognition (of a body of text) problem, my samples are of varying lengths and I need to assign each sample precisely a single label.

I represent my features as 2d matrices, where each column represents a frame in the audio file (ex. 0.1 seconds) and each row is a feature pertaining solely to that time frame (ex. MFCC coefficients). Now although my row-count will be fixed, the number of columns will vary depending on the length of the recording.

I feed in my training and testing data as numpy arrays, they contain a 2D n x y matrix for each sample, where n is a constant (i.e. 13) and y is a variable, which dependent on the length of the current sample.

Unfortunately, scikit-learn doesn't seem to be a big fan of this, time and time again raising me a ValueError: setting an array element with a sequence.. Now I've seen a number of solutions:

Using one of the gadgets in sklearn.feature_extraction to vectorize features in a sequence (text, images of varying size etc.), though most examples I've seen are for text-based problems, so I'm not entirely sure how applicable they are for an audio problem like this.
Taking the mean of the columns to produce a single time-independent feature vector (as can be seen here https://www.youtube.com/watch?v=N1rcKBHlw-Y)
When using a model like a K-NN, distances can be pre-computated manually, circumventing scikit's "sequence or array?" checks altogether.

Now of these three, I would prefer something similar to #1, since it feels like this is the approach scikit is optimized for. Any ideas?

score 3 · Answer 1 · answered Jul 16 '20 at 13:57

3

The standard way to deal with variable length audio (or other time-series), is to split it into a set of fixed-length analysis windows. Example code here. Then one can merge the prediction results, for example by voting.

answered Jul 16 '20 at 13:57

Jon Nordby

5,494
1
21
50

bqbastos · Answer 2 · 2020-07-14T15:02:36.537

Padding is also an option, when you want to have a fixed input shape. With padding, you will append values (usually zero) to the examples of smaller length, so that it will be of the same size of the other examples.

There are a few padding strategies (e.g., pre-padding, when you append values to the beggining of the sequence; post-padding, where you append values to the end of the sequence. The following link cover these padding strategies: Data Preparation for Variable Length Input Sequences. (Tensorflow provides a padding sequence function: Tensorflow's pad_sequences.)

Padding is commonly used in Natural Language Processing (NLP) tasks, when encoded sequences (sentences) are of varying length. Here pad_sequence application in NLP task you can find an example which uses tensorflow's pad_sequences() to preprocess data in an NLP task.

In your case, an option is to do a preprocessing step before feeding the data to your model. In this preprocessing step, you would transform your varying-length input data to a fixed-length input data via pad_sequences() function.

The thing with Padding though is that you need to know the maximum length of a recording. I suppose it could work if I took it be something like max. 3 seconds. Thanks for the suggestion! — madprogramer, Jul 15 '20 at 09:42

How to fit a scikit model, for feature-vectors of varying lengths

2 Answers2