1

I'm trying to separate vocals from a song using a deep learning model. The output is not wrong, but some extra noises cause the signal to sound bad.

The following is 3 seconds of the output file where the noise exists (the areas with a rectangle are the noises):

noise

Link to the audio file

How can I remove these noises from my output file? I can see that these parts have a different amplitude than the other parts of the songs I want. is there a way to filter the signal based on these amplitudes and only allow a specific amplitude range to exist in my signal?

thanks

UPDATE: Please look at the accepted answer and my code for the denoising algorithm that is working as expected!

Morez
  • 2,085
  • 2
  • 10
  • 33

2 Answers2

1

'How can I remove these noises from my output file? You could 'window' it out (multiply those parts of the signal with a step function at e.g. 0.001 for the noise, and at 1 for the signal). This would silence the noisy regions, and keep your regions of interest. It is however not generalisable - and will work only for a pre-specified audio segment, since the window will be fixed.

I can see that these parts have a different amplitude than the other parts of the songs I want. is there a way to filter the signal based on these amplitudes and only allow a specific amplitude range to exist in my signal

Here you could use two approaches 1) running-window to calculate energy (sum of X^{2} over N samples, where X is your audio signal) or 2) generate the Hilbert envelope for your signal, and smooth the envelope with a window of the appropriate length (perhaps 1-100's of milliseconds long). You can set a threshold based on either the energy or Hilbert envelope.

Thejasvi
  • 200
  • 1
  • 11
  • Thank you for the answer. I've generated the Hilbert envelope as you mentioned. How can I find a threshold based on the envelope to remove the noise? – Morez Mar 08 '22 at 16:59
  • This will most likely be a matter of trial and error. If your signal is stereotypical and doesn't vary much you could set a fixed threshold value. Otherwise, you could choose a relative range e.g. 0.1 x max amplitude or something else which heuristically always silences the quieter parts. Please do upvote if you found my main answer useful. – Thejasvi Mar 09 '22 at 14:18
  • I updated the answer so that you can see my implementation of the denoising algorithm using the Hilbert envelope (as you mentioned). Thanks so much for the hints! – Morez Mar 09 '22 at 16:10
  • Hi @morez great that you shared the code. It'd be great if you posted a separate independent answer instead - this matches SO convention and reduces confusion for future users. (See https://meta.stackoverflow.com/questions/387912/can-should-i-edit-my-question-to-an-add-answer). It would also help your profile on SO as users can upvote your answer. – Thejasvi Mar 10 '22 at 21:31
  • Yes, I will do that. Thanks for the heads up – Morez Mar 11 '22 at 12:44
0

I used the accepted answer suggestion and created the following algorithm which uses the Hilbert envelope and denoises parts of the song when there is a noise with no vocals.

def hilbert_metrics(signal):
    '''this calculates the amplitude envelope of the audio and returns it'''
    analytic_signal = sp.signal.hilbert(signal)
    amplitude_envelope = np.abs(analytic_signal)
    instantaneous_phase = np.unwrap(np.angle(analytic_signal))
    instantaneous_frequency = (np.diff(instantaneous_phase) /
                              (2.0*np.pi) * 44100)
    instantaneous_frequency += np.max(instantaneous_frequency)
    return amplitude_envelope, instantaneous_frequency


def denoise(wav_file_handler, hop_length:int=1024, window_length_in_second:float=0.5, threshold_softness:float=4.0, stat_mode="mean", verbose:int=0)->np.array:
  '''This method runs a window on the wav signal.
  it checks the previous segment and the next segment of the current segment and if those segments have a lower than average amplitude / threshold_softness
  then it mens those areas are probably only noise and therefore the middle segment will also become silence
  This method is effective as it looks at the local area and search for the noise
  if the segments have a more than average amplitude /threshold_softness then it probably is actual part of the song
  the lower the threshold_softness, the more extreme the noise detection becomes'''
  stat_mode = str.lower(stat_mode)
  assert stat_mode in ["median", "mean", "mode"], print(f"expected 'mean', 'median' or 'mode' for `stat_mode` but received: '{stat_mode}'")

  def amps_reducer_function(amps):
    if stat_mode == "median":
          return np.median(amps)
    elif stat_mode == "mean":
          return np.mean(amps)
    elif stat_mode == "mode":
          return sp.stats.mode(amps)

  wav = np.copy(wav_file_handler.wav_file)
  amp, freq = hilbert_metrics(wav)
  window_length_frames = int(window_length_in_second*wav_file_handler.sample_rate)
  amp_metric = amps_reducer_function(amp)
  threshold = amp_metric/threshold_softness
  muted_segments_count = 0
  for i in range(window_length_frames, len(wav)-window_length_frames, hop_length):
    segment = amp[i: i+window_length_frames]
    previous_segment_stat = amps_reducer_function(amp[i-window_length_frames: i])
    next_segment_stat = amps_reducer_function(amp[i+window_length_frames: i+window_length_frames*2])
    if previous_segment_stat < threshold and next_segment_stat < threshold:
      if verbose: print(f"previous segment stat: {previous_segment_stat}, threshold: {threshold}, next_segment_stat: {next_segment_stat} ")
      muted_segments_count += 1
      segment *= 0.0
      wav[i: i+window_length_frames] = segment
  if verbose: print(f"Denoising completed! muted {muted_segments_count} segments")
  return wav

This method can definetly improve by using a different threshold or even using low-pass and high-pass filters to remove unwanted frequencies as well.

Here is a example of running the method on a wav signal and you can see the denoising effect:

This is the original signal: Original signal

This is the denoised signal using the default parameters: Denoised signal

This is the same signal which is denoised with threshold_softness = 2 instead of 4: More extreme denoising

This is the same denoising algorithm as the previous one but instead of np.mean, we are using np.median which makes the method to run much faster and gives a similar result:

Median

Morez
  • 2,085
  • 2
  • 10
  • 33