Python | librosa: how to extract human voice from an audio wav file?

Question

Given a wav file (mono 16KHz sampling rate) of an audio recording of a human talking, is there a way to extract just the voice, thereby filtering out most mechanical and background noise? I'm trying to use librosa package in Python 3.6 for this, but can't figure out how piptrack works (or if there is a simpler way).

When tried using an fft/ifft to restrict frequencies to 300-3400 range, the resulting sound was severely distorted.

sr, y = scipy.io.wavfile.read(wav_file_path)
x = np.fft.rfft(y)[0:3400]
x[0:300] = 0
x = np.fft.irfft(x)

can please share some sample audio recordings. After that, I maybe give you some direction to achieve this — Abdul Basit, Sep 28 '20 at 07:13

score 0 · Answer 1 · answered Aug 25 '22 at 10:55

Extracting the human voice of an audio file is an actively researched problem. It's often referred to as 'Speech Enhancement' in scientific literature. Latest developments in the field tend to be presented at the Interspeech and IEEE ICASSP conferences. You can also check out the Deep Noise Surpression Challenge from Microsoft.

The complexity of removing unwanted sound from a speech recording is highly dependent on the unwanted sound, and how much you know about it. If, as your attempt suggest, you are only interested in filtering out low frequency noise, then you may be able to get some noise reduction with a proper low pass filter. Librosa has some filter implementations, and numpy/scipy will give you even more options.

Simply zeroing fft coefficients will give terrible distortion. See this stackoverflow answer as to why this never is a good idea.

Python | librosa: how to extract human voice from an audio wav file?

1 Answers1