audio recognition: resize audio examples to the same length

Question

I have an audio dataset featured with MFCCs and it is a 1D array numpy file. There are 45K of examples in total, so it is a numpy file with 1 column and 45K row. In each row, there is a nested object with N rows and 13 columns. Since the length of each example is different, I am trying to resize them to the same length, so that they all have the same numbers of column and row for further ML training. Here is an example of how to data looks like:

the first example in the dataset, with 13 columns and 43 rows

the second example in the dataset, with 13 columns and 33 rows

I tried to use dynamic time warping but it looks like all the code provided online only show me how to calculate the shortest distance among two audio examples:

import numpy as np
from scipy.spatial.distance import euclidean

from fastdtw import fastdtw

x = np.array([[1,1], [2,2], [3,3], [4,4], [5,5]])
y = np.array([[2,2], [3,3], [4,4]])
distance, path = fastdtw(x, y, dist=euclidean)
print(distance)

But I don't really need to know the distance between two examples, instead, I need to know how to actually resize them, so that it can be put into a symmetric matrix, so I don't even know if the direction I am looking at is right. I have also tried to look into the python tslearn library but had no luck in finding anything useful. thank you!

In audio processing, it is quite common to define a maximal length for all examples and then cut longer ones and perform a padding operation on the shorter ones. Isn't that possible in your case? — BGraf, May 23 '19 at 13:08
@BGraf, thank you for your comment! I did think about your suggestion but in this dataset, the minimum length is only 4 while the maximum length is 93, so it varies a lot and I wasn't sure what would be the best number to use. Is there a way to calculate this number? (my apology if this is a simple question, it is my first time ever to deal with an audio dataset!) — vvx, May 23 '19 at 13:19
That depends on the dataset and especially the task you want to solve. For example if it is some kind of general classification and four timesteps carry enough information to classify the example you could cut all examples to the minimum lentgh of four timesteps. However, if it is a speech recognition task you will probably always need the whole example. In this case, you could pad all shorter example to length 93. — BGraf, May 23 '19 at 13:24
@BGraf I have just plotted the graph to explore the data, here is the plot: https://imgshare.io/images/2019/05/23/1558618214487.jpg It looks like 93 is more like an outlier instead of a common number, and most of the examples have 35 - 60 frames. Would it still be okay to use length 93, or would you recommend to try different numbers and see? thank you! — vvx, May 23 '19 at 13:31
You need to think carefully about your problem and consider what @BGraf asked in the last comment. In certain application you might be able just to resample to certain length, in other just take fixed length (e.g. 20, based on your graph) in other you'd take a method that works with variable input length. — Lukasz Tracewski, May 23 '19 at 14:06
@LukaszTracewski thank you! I have decided to take the whole 93 lengths to check the dataset. Just to confirm that, I need to pad the missing length of the shorter ones with silence, right? What would be a good library to use for this kind of operation? — vvx, May 24 '19 at 09:14

audio recognition: resize audio examples to the same length

0 Answers0