I used the accepted answer suggestion and created the following algorithm which uses the Hilbert envelope and denoises parts of the song when there is a noise with no vocals.
def hilbert_metrics(signal):
'''this calculates the amplitude envelope of the audio and returns it'''
analytic_signal = sp.signal.hilbert(signal)
amplitude_envelope = np.abs(analytic_signal)
instantaneous_phase = np.unwrap(np.angle(analytic_signal))
instantaneous_frequency = (np.diff(instantaneous_phase) /
(2.0*np.pi) * 44100)
instantaneous_frequency += np.max(instantaneous_frequency)
return amplitude_envelope, instantaneous_frequency
def denoise(wav_file_handler, hop_length:int=1024, window_length_in_second:float=0.5, threshold_softness:float=4.0, stat_mode="mean", verbose:int=0)->np.array:
'''This method runs a window on the wav signal.
it checks the previous segment and the next segment of the current segment and if those segments have a lower than average amplitude / threshold_softness
then it mens those areas are probably only noise and therefore the middle segment will also become silence
This method is effective as it looks at the local area and search for the noise
if the segments have a more than average amplitude /threshold_softness then it probably is actual part of the song
the lower the threshold_softness, the more extreme the noise detection becomes'''
stat_mode = str.lower(stat_mode)
assert stat_mode in ["median", "mean", "mode"], print(f"expected 'mean', 'median' or 'mode' for `stat_mode` but received: '{stat_mode}'")
def amps_reducer_function(amps):
if stat_mode == "median":
return np.median(amps)
elif stat_mode == "mean":
return np.mean(amps)
elif stat_mode == "mode":
return sp.stats.mode(amps)
wav = np.copy(wav_file_handler.wav_file)
amp, freq = hilbert_metrics(wav)
window_length_frames = int(window_length_in_second*wav_file_handler.sample_rate)
amp_metric = amps_reducer_function(amp)
threshold = amp_metric/threshold_softness
muted_segments_count = 0
for i in range(window_length_frames, len(wav)-window_length_frames, hop_length):
segment = amp[i: i+window_length_frames]
previous_segment_stat = amps_reducer_function(amp[i-window_length_frames: i])
next_segment_stat = amps_reducer_function(amp[i+window_length_frames: i+window_length_frames*2])
if previous_segment_stat < threshold and next_segment_stat < threshold:
if verbose: print(f"previous segment stat: {previous_segment_stat}, threshold: {threshold}, next_segment_stat: {next_segment_stat} ")
muted_segments_count += 1
segment *= 0.0
wav[i: i+window_length_frames] = segment
if verbose: print(f"Denoising completed! muted {muted_segments_count} segments")
return wav
This method can definetly improve by using a different threshold or even using low-pass and high-pass filters to remove unwanted frequencies as well.
Here is a example of running the method on a wav signal and you can see the denoising effect:
This is the original signal:

This is the denoised signal using the default parameters:

This is the same signal which is denoised with threshold_softness = 2 instead of 4:

This is the same denoising algorithm as the previous one but instead of np.mean
, we are using np.median
which makes the method to run much faster and gives a similar result:
