5

I recently do my homework about MFCC, and I can't figure out some differences between using these libraries.

The 3 libraries I use are:

python_speech_features

SpeechPy

LibROSA

samplerate = 16000
NFFT = 512
NCEPT = 13

1st Part: Mel filter bank

temp1_fb = pyspeech.get_filterbanks(nfilt=NFILT, nfft=NFFT, samplerate=sample1)
# speechpy do not divide 2 and add 1 when initializing
temp2_fb = speechpy.feature.filterbanks(num_filter=NFILT, fftpoints=NFFT, sampling_freq=sample1)
temp3_fb = librosa.filters.mel(sr=sample1, n_fft=NFFT, n_mels=NFILT)
# fix librosa normalized version
temp3_fb /= np.max(temp3_fb, axis=-1)[:, None]

pic1

Only the shape in speechpy will get (, 512), other all (, 257). The figure of librosa is a bit of deformation.

2nd Part: MFCC

# pyspeech without lifter. Using hamming
temp1_mfcc = pyspeech.mfcc(speaker1, samplerate=sample1, winlen=0.025, winstep=0.01, numcep=NCEPT, nfilt=NFILT, nfft=NFFT,
                           preemph=0.97, ceplifter=0, winfunc=np.hamming, appendEnergy=False)
# speechpy need pre-emphasized. Using rectangular window fixed. Mel filter bank is not the same
temp2_mfcc = speechpy.feature.mfcc(emphasized_speaker1, sampling_frequency=sample1, frame_length=0.025, frame_stride=0.01,
                                   num_cepstral=NCEPT, num_filters=NFILT, fft_length=NFFT)
# librosa need pre-emphasized. Using log energy. Its STFT using hanning, but its framing is not the same
temp3_energy = librosa.feature.melspectrogram(emphasized_speaker1, sr=sample1, S=temp3_pow.T, n_fft=NFFT,
                                          hop_length=frame_step, n_mels=NFILT).T
temp3_energy = np.log(temp3_energy)
temp3_mfcc = librosa.feature.mfcc(emphasized_speaker1, sr=sample1, S=temp3_energy.T, n_mfcc=13, dct_type=2, n_fft=NFFT,
                                  hop_length=frame_step).T

pic2

I've tried my best to set the condition faire. The figure of speechpy gets darker.

3rd Part: Delta coefficient

temp1 = pyspeech.delta(mfcc_speaker1, 2)
temp2 = speechpy.processing.derivative_extraction(mfcc_speaker1.T, 1).T
# librosa along the frame axis
temp3 = librosa.feature.delta(mfcc_speaker1, width=5, axis=0, order=1)

pic3

I can't directly set mfcc as argument in speechpy, or it will be very strange. And what these parameters originally act is not the same as my expected.

I'm wondering what factors to make these differences. Is it just somethong I mentioned above? Or I made some mistakes? Hope for details, thanks.

Bill Sun
  • 51
  • 1
  • 3

1 Answers1

2

There are many MFCC implementations and they often differ bit by bit - window function shape, mel filterbank calculation, dct could be different too. It is hard to find a fully compatible library. In the long term it should not matter for you as long as you use the same implementation everywhere. The differences do not affect results.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87