Librosa 0.8.0 | Vocal separation output works, but is sped up to 200% speed

Question

I am working on this script in python to separate vocals from a track and write it to a music file. I have chosen librosa as a library for this. Here is the code:

import numpy as np
import librosa.display
import soundfile as sf
import matplotlib.pyplot as plt
import librosa

y, sr = librosa.load('testInput.mp3')
output_file_path = "testOutput.wav"

S_full, phase = librosa.magphase(librosa.stft(y))
S_filter = librosa.decompose.nn_filter(S_full,
                                       aggregate=np.median,
                                       metric='cosine',
                                       width=int(librosa.time_to_frames(2, sr=sr)))
S_filter = np.minimum(S_full, S_filter)
margin_i, margin_v = 2, 10
power = 2

mask_i = librosa.util.softmask(S_filter,
                               margin_i * (S_full - S_filter),
                               power=power)

mask_v = librosa.util.softmask(S_full - S_filter,
                               margin_v * S_filter,
                               power=power)

# Once we have the masks, simply multiply them with the input spectrum
# to separate the components
S_foreground = mask_v * S_full
S_background = mask_i * S_full
D_foreground = S_foreground * phase
y_foreground = librosa.istft(D_foreground)
sf.write(output_file_path, y_foreground, samplerate=44100, subtype='PCM_24')

It works somewhat, however the output is sped up to 200%, which also results in the voice being pitched up a lot. Whatever the input is, the output will sound alvin and the chipmunks alike. Does anyone have an idea how to fix this, or what i am doing wrong?

Does nn_filter and numpy.minimum actually work OK for vocal separation? — Jon Nordby, Dec 21 '20 at 23:04
To be honest, very poorly. If you are aware of a better method I am very interested. — Bo Terham, Dec 23 '20 at 22:06
https://github.com/deezer/spleeter (2 stems)? Can be tested at https://splitter.ai/ — Jon Nordby, Dec 23 '20 at 23:31
@jonnor i've tried to get spleeter to work. Maybe important extra info is that i am writing this code for a school project. The 'client' (just our teacher) does not want to do any installations by itself. So installing it with conda was not going to work, and with python I couldn't get ffmpeg to work with spleeter. Beyond that i don't even have the slightest idea how to automatically add it to the system variables. TLDR: librosa was the easy option. — Bo Terham, Dec 24 '20 at 01:00
If no installation on user (teacher) behalf, then the best thing is a web service / web page that they can use / test/demo. — Jon Nordby, Dec 24 '20 at 12:40
My appologies. I should have been more clear. Our project is some kind of spotify knock-off with extra functionality. We just finished our c# courses. So almost the entire application is in c# except this script and the database stuff. c# had no vocal separation libraries hence why this python script is the exception. However, the teacher only wants to install the main application and no other dependecies like python versions, ffmpeg or conda. Or we must be able to automatically install it. @jonnor — Bo Terham, Dec 24 '20 at 12:52
Ok, then presumably it is Windows software. There are many tools to bundle Python in a Windows installer. In principle that is the same if you just include librosa, or if you include something like spleeter - though in practice the more complicated the software the more issues you will have — Jon Nordby, Dec 24 '20 at 12:54
Anyway, *good* vocal separation needs something as complicated as spleeter. With trained neural nets etc — Jon Nordby, Dec 24 '20 at 12:54

score 1 · Answer 1 · answered Dec 21 '20 at 23:01

1

librosa.load will by default resample to 22050 Hz. To preserve the original sampelrate of the input, use librosa.load(..., sr=None). However note that many parameters in librosa are tuned for 22050 Hz, like FFT lengths etc. In this example the stft and istft, at least.

So you may also want to try keeping that samplerate. In either case it would be good to use samplerate=sr in your call to sf.write() to avoid hardcoding it.

answered Dec 21 '20 at 23:01

Jon Nordby

5,494
1
21
50

You're right. This was the working solution (I forgot to feed back to this page). An interesting not for people who stumble on this thread in the future is that in librosa's example code they set the sample rate in the output to 44100 (which is weird, because that was also what caused twice the speed). This should be set to the returned sample rate of the load method. Thanks for the help! – Bo Terham Dec 23 '20 at 22:10

Librosa 0.8.0 | Vocal separation output works, but is sped up to 200% speed

1 Answers1