4

I'm trying to make tensorflow mfcc give me the same results as python lybrosa mfcc i have tried to match all the default parameters that are used by librosa in my tensorflow code and got a different result

this is the tensorflow code that i have used :

waveform = contrib_audio.decode_wav(
 audio_binary,
 desired_channels=1,
 desired_samples=sample_rate,
 name='decoded_sample_data')


sample_rate = 16000

transwav = tf.transpose(waveform[0])

stfts = tf.contrib.signal.stft(transwav,
  frame_length=2048,
  frame_step=512,
  fft_length=2048,
  window_fn=functools.partial(tf.contrib.signal.hann_window, 
  periodic=False), 
  pad_end=True)

spectrograms = tf.abs(stfts)
num_spectrogram_bins = stfts.shape[-1].value
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 0.0,8000.0, 128
linear_to_mel_weight_matrix = 
tf.contrib.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
   upper_edge_hertz)
mel_spectrograms = tf.tensordot(
 spectrograms, 
 linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
linear_to_mel_weight_matrix.shape[-1:]))
log_mel_spectrograms = tf.log(mel_spectrograms + 1e-6)
mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
    log_mel_spectrograms)[..., :20]

the equivalent in librosa: libr_mfcc = librosa.feature.mfcc(wav, 16000)

the following are the graphs of the results: tensorflow mfcc results

librosa mfcc results

Eli Leszczynski
  • 145
  • 1
  • 7

5 Answers5

6

I'm the author of tf.signal. Sorry for not seeing this post sooner, but you can get librosa and tf.signal.stft to match if you center-pad the signal before passing it to tf.signal.stft. See this GitHub issue for more details.

rryan
  • 797
  • 1
  • 7
  • 6
1

I spent a whole 1 day trying to make them match. Even the rryan's solution didn't work for me (center=False in librosa), but I finally found out, that TF and librosa STFT's match only for the case win_length==n_fft in librosa and frame_length==fft_length in TF. That's why rryan's colab example is working, but you can try that if you set frame_length!=fft_length, the amplitudes are very different (although visually, after plotting, the patterns look similar). Typical example - if you choose some win_length/frame_length and then you want to set n_fft/fft_length to the smallest power of 2 greater than win_length/frame_length, then the results will be different. So you need to stick with the inefficient FFT given by your window size... I don't know why it is so, but that's how it is, hopefully it will be helpful for someone.

0

The output of contrib_audio.decode_wav should be DecodeWav with { audio, sample_rate } and audio shape is (sample_rate, 1), so what is the purpose for getting first item of waveform and do transpose?

transwav = tf.transpose(waveform[0])

0

No straight forward way, since librosa stft uses center=True which does not comply with tf stft. Had it been center=False, stft tf/librosa would give near enough results. see colab sniff

But even though, trying to import the librosa code into tf is a big headache. Here is what I started and gave up. Near but not near enough.

def pow2db_tf(X):
    amin=1e-10
    top_db=80.0
    ref_value = 1.0
    log10 = 2.302585092994046
    log_spec = (10.0/log10) * tf.log(tf.maximum(amin, X))
    log_spec -= (10.0/log10) * tf.log(tf.maximum(amin, ref_value))
    pow2db = tf.maximum(log_spec, tf.reduce_max(log_spec) - top_db)
    return pow2db


def librosa_feature_like_tf(x, sr=16000, n_fft=2048, n_mfcc=20):
    mel_basis = librosa.filters.mel(sr, n_fft).astype(np.float32)
    mel_basis = mel_basis.reshape(1, int(n_fft/2+1), -1)
    tf_stft = tf.contrib.signal.stft(x, frame_length=n_fft, frame_step=hop_length, fft_length=n_fft)
    print ("tf_stft", tf_stft.shape)
    tf_S = tf.matmul(tf.abs(tf_stft), mel_basis);
    print ("tf_S", tf_S.shape)
    tfdct = tf.spectral.dct(pow2db_tf(tf_S), norm='ortho'); print ("tfdct", tfdct.shape)
    print ("tfdct before cut", tfdct.shape)
    tfdct = tfdct[:,:,:n_mfcc];
    print ("tfdct afer cut", tfdct.shape)
    #tfdct = tf.transpose(tfdct,[0,2,1]);print ("tfdct afer traspose", tfdct.shape)
    return tfdct


x = tf.placeholder(tf.float32, shape=[None, 16000], name ='x')
tf_feature = librosa_feature_like_tf(x)
print("tf_feature", tf_feature.shape)
mfcc_rosa = librosa.feature.mfcc(wav, sr).T
print("mfcc_rosa", mfcc_rosa.shape)
Ishay Tubi
  • 21
  • 3
0

For anyone still looking for this: I had a similar problem some time ago: Matching librosa's mel filterbanks/mel spectrogram to a tensorflow implementation. The solution was to use a different windowing approach for the spectrogram and librosa's mel matrix as constant tensor. See here and here.

Daniel
  • 51
  • 3