5

Given a wav file (mono 16KHz sampling rate) of an audio recording of a human talking, is there a way to extract just the voice, thereby filtering out most mechanical and background noise? I'm trying to use librosa package in Python 3.6 for this, but can't figure out how piptrack works (or if there is a simpler way).

When tried using an fft/ifft to restrict frequencies to 300-3400 range, the resulting sound was severely distorted.

sr, y = scipy.io.wavfile.read(wav_file_path)
x = np.fft.rfft(y)[0:3400]
x[0:300] = 0
x = np.fft.irfft(x)
Oleg Melnikov
  • 3,080
  • 3
  • 34
  • 65

1 Answers1

0

Extracting the human voice of an audio file is an actively researched problem. It's often referred to as 'Speech Enhancement' in scientific literature. Latest developments in the field tend to be presented at the Interspeech and IEEE ICASSP conferences. You can also check out the Deep Noise Surpression Challenge from Microsoft.

The complexity of removing unwanted sound from a speech recording is highly dependent on the unwanted sound, and how much you know about it. If, as your attempt suggest, you are only interested in filtering out low frequency noise, then you may be able to get some noise reduction with a proper low pass filter. Librosa has some filter implementations, and numpy/scipy will give you even more options.

Simply zeroing fft coefficients will give terrible distortion. See this stackoverflow answer as to why this never is a good idea.

5Ke
  • 1,209
  • 11
  • 28