How to loop through multiple audio files to extract MFCC features to feed to CNN model?

Question

need some help with MFCC feature extraction on librosa. My goal is to calculate MFCC from 160 audio files and use the output to train a convolutional neural network. But I'm having some issues with the code. I'm primarily a c++ user, so python is still tripping me up a bit.

My project is classifying songs based on time period (dataset is songs from 1950's to now). My goal is to extract the features and then train a CNN to predict the time period a song is from.

To start, I used glob to create a list of 160 mp3 files and saved it in the audio_files variable. Then, I looped through audio_files, loaded each mp3 file using librosa.load, and then calculated the MFCC. The issue is whenever I stop the loop before it finishes and try to print out the mfcc1 variable, it only outputs the last MFCC matrix it calculated. I need it to save all the MFCC data for each mp3 file it loops through. In the end, I want to use this data to feed it to the CNN model I'm building. Any tips on this? thank you

I saw some people were using CSV files that includes information about each audio file to loop over each row. Is this necessary? I see that you can make a dataframe that way with all the MFCC data about each file all in one place.

And I know that mfcc1 is of type numpy.ndarray, will this format work for the CNN? Thank you all again

import librosa
import librosa.display
import IPython.display as ipd
import matplotlib.pylab as plt
import os
from glob import glob
%matplotlib inline
from sklearn.preprocessing import minmax_scale

audio_files = glob('/Users/wt56/Downloads/522 Dataset/*.mp3')

for files in audio_files:
    print(files)

for file in audio_files:
    x, sr = librosa.load(file)

    mfcc1 = librosa.feature.mfcc(y=x, sr=22050)
    print(mfcc1)

you can use a dictionary ( mfcc1 = {} ) to store all the mfcc arrays. However, for classification, you really need to divide your audio tracks into smaller segments (ideally 1-2 seconds long) and then perform an mfcc analysis on each of those segments. Trying to extract a single mfcc array per song is almost certainly not going to work. — dsp_user, Apr 05 '23 at 12:58
@dsp_user I hadn't even thought of only getting the mfcc of a segment (couple seconds) of an audio file. I will do that, and I imagine it will take up less overhead to run it as well. Thank you for your help. Dictionary is a great idea. — wangowango, Apr 05 '23 at 22:16
@dsp_user Just a clarification... when you said to run the mfcc analysis on each of the broken up segments of the audio files, do you mean the mfcc is calculated on 2 seconds of the entire file, or I need to do mfcc on each 2 second long segment for the length of the whole audio file? i.e. if a song is 30 seconds, do I need to run the mfcc 15 times (in 2-second long segments) or just one 2 second segment from the entire audio track? Thank you — wangowango, Apr 05 '23 at 22:24
you should do the latter, i.e run the mfcc analysis for every 1-2 seconds of an audio file. Note that the actual number of mfcc arrays is much greater than the number of segments per song. The reason for this is the fact that, in order to compute the mfcc, stft must be calculated first (librosa.feature.mfcc already does that), however, since stft-s (short time fourier transform) are usually calculated on 1024 or 2048 samples time frames — dsp_user, Apr 06 '23 at 07:31
using overlapping sliding windows where the hop length is 256 and 512 samples long respectively, the total number of mfcc arrays per song can be calculated as number_segments_per_song * (single_segment_samples_length / frame_samples_length) * (frame_samples_length / hop_samples_length). An example of this process can be found here https://github.com/musikalkemist/DeepLearningForAudioWithPython/blob/master/12-%20Music%20genre%20classification:%20Preparing%20the%20dataset/code/extract_data.py — dsp_user, Apr 06 '23 at 07:43
@dsp_user Hi, thank you so much for your help. I’ve been doing more research because I was completely lost on how to do what you were suggesting, but now it’s starting to make sense a little bit. To continue, I think my next step is to make sure all my audio files are the same length so that I can decide how many segments I want per song. Also, all the examples that I’ve been seeing are using WAV audio files but I’m using MP3 files. Is there really a difference, should I be using wav file? — wangowango, Apr 12 '23 at 07:11
sure you can use mp3 files. In fact, any file format that librosa is able to load will do. And yes, your tracks should be the same length, which will make further processing much easier. The github project I've linked to is actually a good starting place because it demonstrates how to 1) extract the mfcc from a bunch of audio tracks and save them to a (mfcc features) file and 2) load the mfcc from the file and feed them into a neural network. — dsp_user, Apr 12 '23 at 08:04
@dsp_user Amazing, thank you so much!!!!! The GitHub link is also very helpful, i've gotten many insights from it. — wangowango, Apr 12 '23 at 09:09

How to loop through multiple audio files to extract MFCC features to feed to CNN model?

0 Answers0