Reconstructing audio from a melspectrogram has some clipping with librosa

Question

I am doing:


    melspectrogram = librosa.feature.melspectrogram(
        y=samples, sr=sample_rate, window=scipy.signal.hanning, n_fft=n_fft, hop_length=hop_length)

    print('melspectrogram.shape', melspectrogram.shape)
    print(melspectrogram)

    audio_signal = librosa.feature.inverse.mel_to_audio(
        melspectrogram, sr=sample_rate, n_fft=n_fft, hop_length=hop_length, window=scipy.signal.hanning)
    print(audio_signal, audio_signal.shape)

    sf.write('test.wav', audio_signal, sample_rate)

And the reconstructed wav file sounds very similar to the original but has some slight clipping and audio artifacts. Is there some way to reconstruct more perfectly?

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

As the documentation states about mel_to_audio:

This is primarily a convenience wrapper for:
S = librosa.feature.inverse.mel_to_stft(M)
y = librosa.griffinlim(S)

In other words, the generated Mel spectrogram is used to approximate the STFT magnitude. The STFT spectrogram is then converted back the time domain using the Griffin Lim algorithm.

The conversion from Mel to STFT spectrogram is not entirely lossless (there may be overlapping frequency ranges, due to the overlapping triangular filters used in the construction of the Mel spectrogram), and the conversion from STFT magnitude spectrogram to the time domain (i.e., to audio) is certainly not perfect, as the STFT magnitude spectrogram is lacking the phase information, which must be approximated using the Griffin Lim algorithm. This approximation is never perfect and introduces phase artifacts (metallic "phasiness").

Not using the Mel-scale, but instead simply using STFT and inverse STFT leads to much better results. However, as soon as you start manipulating anything in the frequency domain before inversion, you will run into similar problems, but probably not as big as when using the Mel spectrogram.

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 24 '20 at 11:34

Hendrik

5,085
24
56

How can I invert a regular `scipy.spectrogram`? – Shamoon Feb 24 '20 at 13:45
`y = librosa.griffinlim(S)` – Hendrik Feb 24 '20 at 13:51
Interesting - so I can apply the same algorithm? However, the shape is different, isn't it? A regular `scipy.spectogram` returns `times, freqs, spec` – Shamoon Feb 24 '20 at 13:52
If you still have the phases, use `librosa.core.istft` and it will sound better—if you haven't manipulated it. – Hendrik Feb 24 '20 at 13:53
I'm doing an ML application, so I'm working with spectrogram information for my inputs (and outputs). So I don't think I have the phases, right? – Shamoon Feb 24 '20 at 13:54
Oh—wait. I didn't read your questions closely enough. You can use Griffin Lim to reconstruct any magnitude/power spectrogram without knowing the phase. The results are soso. If you do have the phase, i.e. the complex values, use `librosa.core.istft`. When using `scipy.spectrogram` you get a triplet: `f, t, Sxx`. You should only need `Sxx`. How to reconstruct something from `Sxx` depends on the `mode` you chose during creation of that spectrogram. – Hendrik Feb 24 '20 at 13:58
Typically, the "spectrogram" is either power or magnitude—so no phases. – Hendrik Feb 24 '20 at 14:00
I'm using the default `mode`, which looks like it's `psd`. So how would I specify that with `librosa.griffinlim`? I don't see a parameter for that – Shamoon Feb 24 '20 at 14:01
1

created another question to better capture what I want: https://stackoverflow.com/questions/60377585/how-can-i-reverse-a-spicy-signal-spectrogram-to-audio-with-python – Shamoon Feb 24 '20 at 14:05

Reconstructing audio from a melspectrogram has some clipping with librosa

1 Answers1