2

I am extracting MFCC features from mp3 voice files but I do want to keep the source files unchangeable and without adding any new files. My processing includes the following steps:

  • Load .mp3 file, eliminate silence, and generate .wav data using pydub
  • Read audio data and rate using scipy.io.wavfile.read()
  • Extract features using python_speech_features

However, eliminate_silence() returns an AudioSegmentobject, whereas the scipy.io.wavfile.read() accepts a .wav filename and so I am forced to temporarily save/export the data as wave to ensure the transition in between. This step is memory and time consuming and so my question is: How can I avoid the export wave file step? or is there a workaround for it?

Here is my code.

import os
from pydub import AudioSegment
from scipy.io.wavfile import read
from sklearn import preprocessing
from python_speech_features import mfcc
from pydub.silence import split_on_silence

def eliminate_silence(input_path):
    """ Eliminate silent chunks from original call recording """
    # Import input wave file
    sound  = AudioSegment.from_mp3(input_path)
    chunks = split_on_silence(sound,
                              # split on silences longer than 1000ms (1 sec)
                              min_silence_len=500,
                              # anything under -16 dBFS is considered silence
                              silence_thresh=-30,
                              # keep 200 ms of leading/trailing silence
                              keep_silence=100)

    output_chunks = AudioSegment.empty()
    for chunk in chunks: output_chunks += chunk
    return output_chunks


silence_clear_data = eliminate_silence("file.mp3")
silence_clear_data.export("temp.wav", format="wav")
rate, audio = read("temp.wav")
os.remove("temp.wav")

# Extract MFCCs
mfcc_feature = mfcc(audio, rate, winlen = 0.025, winstep = 0.01, numcep = 15,
                    nfilt = 35, nfft = 512, appendEnergy = True)
mfcc_feature = preprocessing.scale(mfcc_feature)
SuperKogito
  • 2,998
  • 3
  • 16
  • 37

2 Answers2

3

I'm currently working on a project where I use audio cut using silences and mfcc coefficients, I leave my solution:

import pydub
import python_speech_features as p
import numpy as np

def generate_mfcc_without_silences(path):
    #get audio and change frame rate to 16KHz
    audio_file = pydub.AudioSegment.from_wav(path)
    audio_file = audio_file.set_frame_rate(16000)
    #cut audio using silences
    chunks = pydub.silence.split_on_silence(audio_file, silence_thresh=audio_file.dBFS, min_silence_len=200)
    mfccs = []
    for chunk in chunks:
        #compute mfcc from chunk array
        np_chunk = np.frombuffer(chunk.get_array_of_samples(), dtype=np.int16)
        mfccs.append(p.mfcc(np_chunk, samplerate=audio_file.frame_rate, numcep=26))
    return mfccs

Considerations:

·I change the audio to 16KHz but it is optional

·I have the value min_silence_len to 200 because I want to try to get the single words

Using the content of my function and your requirements, the function you need is maybe:

import pydub
import python_speech_features as p
import numpy as np
from sklearn import preprocessing

def mfcc_from_audio_without_silences(path):
    audio_file  = pydub.AudioSegment.from_mp3(input_path)
    chunks = pydub.silence.split_on_silence(audio_file, silence_thresh=-30, min_silence_len=500, keep_silence=100)

    output_chunks = pydub.AudioSegment.empty()
    for chunk in chunks:
        output_chunks += chunk

    output_chunks = np.frombuffer(output_chunks.get_array_of_samples(), dtype=np.int16)
    mfcc_feature = p.mfcc(output_chunks, samplerate=audio_file.frame_rate, numcep=15, nfilt = 35)
    return preprocessing.scale(mfcc_feature)
jlcdev
  • 31
  • 3
  • `np.frombuffer(output_chunks.get_array_of_samples(), dtype=np.int16)` this part particularly solved my issue of conversion. Thanks! – Furkanicus Jun 08 '19 at 15:38
1

Looks like AudioSegment.get_array_of_samples() is what you need. (You may need to construct a numpy array from that array before passing it to mfcc.)

Martin Stone
  • 12,682
  • 2
  • 39
  • 53
  • Thank you for your answer, it was helpful. Accordingly I changed the `eliminate_silence()` and made it return the following `return np.array(output_chunks.get_array_of_samples())` and I pass the output directly to `mfcc()`. This works correctly but surprisingly the code is slower. Do you have any idea, why is it the case? and if it is possible to accelerate it? – SuperKogito Aug 14 '18 at 13:24
  • I'm not sure why it would be slower. Presumably there are some inefficiencies in converting from one array type to another. Maybe specifying the data type when constructing the np.array would speed it up. Did you try passing the original array.array to mfcc()? Maybe it can handle that format directly. – Martin Stone Aug 14 '18 at 13:33
  • I think you are right but I have no clue what inefficiencies are causing the delay. Specifiying the data type did not make a notable difference and passing the array directly to `mfcc()` resulted in a `TypeError: can't multiply sequence by non-int of type 'float' ` – SuperKogito Aug 14 '18 at 14:22