is it possible to get exactly the same results from tensorflow mfcc and librosa mfcc?

Question

I'm trying to make tensorflow mfcc give me the same results as python lybrosa mfcc i have tried to match all the default parameters that are used by librosa in my tensorflow code and got a different result

this is the tensorflow code that i have used :

waveform = contrib_audio.decode_wav(
 audio_binary,
 desired_channels=1,
 desired_samples=sample_rate,
 name='decoded_sample_data')


sample_rate = 16000

transwav = tf.transpose(waveform[0])

stfts = tf.contrib.signal.stft(transwav,
  frame_length=2048,
  frame_step=512,
  fft_length=2048,
  window_fn=functools.partial(tf.contrib.signal.hann_window, 
  periodic=False), 
  pad_end=True)

spectrograms = tf.abs(stfts)
num_spectrogram_bins = stfts.shape[-1].value
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 0.0,8000.0, 128
linear_to_mel_weight_matrix = 
tf.contrib.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
   upper_edge_hertz)
mel_spectrograms = tf.tensordot(
 spectrograms, 
 linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
linear_to_mel_weight_matrix.shape[-1:]))
log_mel_spectrograms = tf.log(mel_spectrograms + 1e-6)
mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
    log_mel_spectrograms)[..., :20]

the equivalent in librosa: libr_mfcc = librosa.feature.mfcc(wav, 16000)

the following are the graphs of the results: tensorflow mfcc results

librosa mfcc results

score 6 · Answer 1 · answered Jan 27 '19 at 07:30

6

I'm the author of tf.signal. Sorry for not seeing this post sooner, but you can get librosa and tf.signal.stft to match if you center-pad the signal before passing it to tf.signal.stft. See this GitHub issue for more details.

answered Jan 27 '19 at 07:30

rryan

797
1
7
6

Stano Hrivňak · Answer 2 · 2020-03-11T20:46:19.393

I spent a whole 1 day trying to make them match. Even the rryan's solution didn't work for me (center=False in librosa), but I finally found out, that TF and librosa STFT's match only for the case win_length==n_fft in librosa and frame_length==fft_length in TF. That's why rryan's colab example is working, but you can try that if you set frame_length!=fft_length, the amplitudes are very different (although visually, after plotting, the patterns look similar). Typical example - if you choose some win_length/frame_length and then you want to set n_fft/fft_length to the smallest power of 2 greater than win_length/frame_length, then the results will be different. So you need to stick with the inefficient FFT given by your window size... I don't know why it is so, but that's how it is, hopefully it will be helpful for someone.

score 0 · Answer 3 · answered Nov 03 '17 at 08:11

0

The output of contrib_audio.decode_wav should be DecodeWav with { audio, sample_rate } and audio shape is (sample_rate, 1), so what is the purpose for getting first item of waveform and do transpose?

transwav = tf.transpose(waveform[0])

answered Nov 03 '17 at 08:11

Joseph Cheng

1

it did not work without the transpose i got a wrong shape error. – Eli Leszczynski Nov 08 '17 at 08:46

score 0 · Answer 4 · answered Jul 03 '18 at 12:07

No straight forward way, since librosa stft uses center=True which does not comply with tf stft. Had it been center=False, stft tf/librosa would give near enough results. see colab sniff

But even though, trying to import the librosa code into tf is a big headache. Here is what I started and gave up. Near but not near enough.

def pow2db_tf(X):
    amin=1e-10
    top_db=80.0
    ref_value = 1.0
    log10 = 2.302585092994046
    log_spec = (10.0/log10) * tf.log(tf.maximum(amin, X))
    log_spec -= (10.0/log10) * tf.log(tf.maximum(amin, ref_value))
    pow2db = tf.maximum(log_spec, tf.reduce_max(log_spec) - top_db)
    return pow2db


def librosa_feature_like_tf(x, sr=16000, n_fft=2048, n_mfcc=20):
    mel_basis = librosa.filters.mel(sr, n_fft).astype(np.float32)
    mel_basis = mel_basis.reshape(1, int(n_fft/2+1), -1)
    tf_stft = tf.contrib.signal.stft(x, frame_length=n_fft, frame_step=hop_length, fft_length=n_fft)
    print ("tf_stft", tf_stft.shape)
    tf_S = tf.matmul(tf.abs(tf_stft), mel_basis);
    print ("tf_S", tf_S.shape)
    tfdct = tf.spectral.dct(pow2db_tf(tf_S), norm='ortho'); print ("tfdct", tfdct.shape)
    print ("tfdct before cut", tfdct.shape)
    tfdct = tfdct[:,:,:n_mfcc];
    print ("tfdct afer cut", tfdct.shape)
    #tfdct = tf.transpose(tfdct,[0,2,1]);print ("tfdct afer traspose", tfdct.shape)
    return tfdct


x = tf.placeholder(tf.float32, shape=[None, 16000], name ='x')
tf_feature = librosa_feature_like_tf(x)
print("tf_feature", tf_feature.shape)
mfcc_rosa = librosa.feature.mfcc(wav, sr).T
print("mfcc_rosa", mfcc_rosa.shape)

score 0 · Answer 5 · answered Dec 15 '22 at 20:30

For anyone still looking for this: I had a similar problem some time ago: Matching librosa's mel filterbanks/mel spectrogram to a tensorflow implementation. The solution was to use a different windowing approach for the spectrogram and librosa's mel matrix as constant tensor. See here and here.

is it possible to get exactly the same results from tensorflow mfcc and librosa mfcc?

5 Answers5