Recorded audio of one note produces multiple onset times

Question

I am using the Librosa library for pitch and onset detection. Specifically, I am using onset_detect and piptrack.

This is my code:

def detect_pitch(y, sr, onset_offset=5, fmin=75, fmax=1400):
  y = highpass_filter(y, sr)

  onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
  pitches, magnitudes = librosa.piptrack(y=y, sr=sr, fmin=fmin, fmax=fmax)

  notes = []

  for i in range(0, len(onset_frames)):
    onset = onset_frames[i] + onset_offset
    index = magnitudes[:, onset].argmax()
    pitch = pitches[index, onset]
    if (pitch != 0):
      notes.append(librosa.hz_to_note(pitch))

  return notes

def highpass_filter(y, sr):
  filter_stop_freq = 70  # Hz
  filter_pass_freq = 100  # Hz
  filter_order = 1001

  # High-pass filter
  nyquist_rate = sr / 2.
  desired = (0, 0, 1, 1)
  bands = (0, filter_stop_freq, filter_pass_freq, nyquist_rate)
  filter_coefs = signal.firls(filter_order, bands, desired, nyq=nyquist_rate)

  # Apply high-pass filter
  filtered_audio = signal.filtfilt(filter_coefs, [1], y)
  return filtered_audio

When running this on guitar audio samples recorded in a studio, therefore samples without noise (like this), I get very good results in both functions. The onset times are correct and the frequencies are almost always correct (with some octave errors sometimes).

However, a big problem arises when I try to record my own guitar sounds with my cheap microphone. I get audio files with noise, such as this. The onset_detect algorithm gets confused and thinks that noise contains onset times. Therefore, I get very bad results. I get many onset times even if my audio file consists of one note.

Here are two waveforms. The first is of a guitar sample of a B3 note recorded in a studio, whereas the second is my recording of an E2 note.

The result of the first is correctly B3 (the one onset time was detected). The result of the second is an array of 7 elements, which means that 7 onset times were detected, instead of 1! One of those elements is the correct onset time, other elements are just random peaks in the noise part.

Another example is this audio file containing the notes B3, C4, D4, E4:

As you can see, the noise is clear and my high-pass filter has not helped (this is the waveform after applying the filter).

I assume this is a matter of noise, as the difference between those files lies there. If yes, what could I do to reduce it? I have tried using a high-pass filter but there is no change.

Your guitar isn't producing a sine wave. You're going to have harmonics... — Brad, May 16 '17 at 23:44
I am expecting harmonics, but I am not sure why more than one onset-times are detected. I think it is because of the unstable structure of the noisy note. — pavlos163, May 16 '17 at 23:49
You will better understand with an example: https://www.youtube.com/watch?v=kMqGuF8VoRo — Casimir et Hippolyte, May 16 '17 at 23:51
More seriously did you try to cut your sample to make it start with the attack? — Casimir et Hippolyte, May 16 '17 at 23:53
Yes, the sample rates are the same. The duration of the samples however are differnet. By cutting the samples, do you mean discard the part of the signal before the string has been plucked? To start analysis at `t` + 1 frame where `t` is the real onset time of the note? — pavlos163, May 17 '17 at 00:04
Each sound follows an ADSR envelope (Attack Decay Sustain Release). The attack is the time where the sound amplitude is growing to the max (from zero to max). I don't know how this module works, but assuming it is a dummy algorithm based on *Fourier series*, I would make things simple for it and start my test at the zero of the attack *(and eventually test after more difficult situations)*. — Casimir et Hippolyte, May 17 '17 at 00:10
Also, if you want to increase the chance to have a *working* result, start with more high notes (that have less harmonics). For lower notes, try to mute other guitar strings (that may vibrate by sympathy). — Casimir et Hippolyte, May 17 '17 at 00:19
The problem is that, the algorithm of detecting pitch is indeed working with Fourier series, but the onset detection algorithm (which is the core of the problem) works by finding peaks (using some heuristics, I am not sure what exactly) in the onset strength envelope. See this: http://librosa.github.io/librosa/generated/librosa.onset.onset_detect.html — pavlos163, May 17 '17 at 22:29
You can try to remove the noise by using spectral subtraction before you try to determine the pitch. The noise spectra should be determined by taking a part of the waveform before the onset of the note and run it through an FFT. Now, all you need to do is subtract the noise spectra from the signal's spectra (signal+noise) — dsp_user, May 23 '17 at 07:50
How can I do something "before the onset" if the onset algorithm is outputting many wrong onsets? — pavlos163, May 23 '17 at 11:45
Please see updated question with more detail, code for high-pass filter and one more example. — pavlos163, May 23 '17 at 14:04

score 5 · Accepted Answer · answered May 27 '17 at 07:32

I have three observations to share.

First, after a bit of playing around, I've concluded that the onset detection algorithm appears as if it's probably probably been designed to automatically rescale its own operation in order to take into account local background noise at any given instant. This is likely in order so that it can detect onset times in pianissimo sections with equal likelihood as it would in fortissimo sections. This has the unfortunate result that the algorithm tends to trigger on background noise coming from your cheap microphone--the onset detection algorithm honestly thinks it's simply listening to pianissimo music.

A second observation is that roughly the first ~2200 samples in your recorded example (roughly the first 0.1 seconds) are a bit wonky, in the sense that the noise truly is nearly zero during that short initial interval. Try zooming way into the waveform at the starting point and you'll see what I mean. Unfortunately, the start of the guitar playing follows so quickly after the noise onset (roughly around sample 3000) that the algorithm is unable to resolve the two independently--instead it simply merges the two into a single onset event that begins about 0.1 seconds too early. I therefore cut out roughly the first 2240 samples in order to "normalize" the file (I don't think this is cheating though; it's an edge effect that would likely disappear if you had simply recorded a second or so of initial silence prior to plucking the first string, as one would normally do).

My third observation is that frequency-based filtering only works if the noise and the music are actually in somewhat different frequency bands. That may be true in this case, however I don't think you've demonstrated it yet. Therefore, instead of frequency-based filtering, I elected to try a different approach: thresholding. I used the final 3 seconds of your recording, where there is no guitar playing, in order to estimate the typical background noise level throughout the recording, in units of RMS energy, and then I used that median value to set a minimum energy threshold which was calculated to lie safely above the median. Only onset events returned by the detector occurring at times when the RMS energy is above the threshold are accepted as "valid".

An example script is shown below:

import librosa
import numpy as np
import matplotlib.pyplot as plt

# I played around with this but ultimately kept the default value
hoplen=512

y, sr = librosa.core.load("./Vocaroo_s07Dx8dWGAR0.mp3")
# Note that the first ~2240 samples (0.1 seconds) are anomalously low noise,
# so cut out this section from processing
start = 2240
y = y[start:]
idx = np.arange(len(y))

# Calcualte the onset frames in the usual way
onset_frames = librosa.onset.onset_detect(y=y, sr=sr, hop_length=hoplen)
onstm = librosa.frames_to_time(onset_frames, sr=sr, hop_length=hoplen)

# Calculate RMS energy per frame.  I shortened the frame length from the
# default value in order to avoid ending up with too much smoothing
rmse = librosa.feature.rmse(y=y, frame_length=512, hop_length=hoplen)[0,]
envtm = librosa.frames_to_time(np.arange(len(rmse)), sr=sr, hop_length=hoplen)
# Use final 3 seconds of recording in order to estimate median noise level
# and typical variation
noiseidx = [envtm > envtm[-1] - 3.0]
noisemedian = np.percentile(rmse[noiseidx], 50)
sigma = np.percentile(rmse[noiseidx], 84.1) - noisemedian
# Set the minimum RMS energy threshold that is needed in order to declare
# an "onset" event to be equal to 5 sigma above the median
threshold = noisemedian + 5*sigma
threshidx = [rmse > threshold]
# Choose the corrected onset times as only those which meet the RMS energy
# minimum threshold requirement
correctedonstm = onstm[[tm in envtm[threshidx] for tm in onstm]]

# Print both in units of actual time (seconds) and sample ID number
print(correctedonstm+start/sr)
print(correctedonstm*sr+start)

fg = plt.figure(figsize=[12, 8])

# Print the waveform together with onset times superimposed in red
ax1 = fg.add_subplot(2,1,1)
ax1.plot(idx+start, y)
for ii in correctedonstm*sr+start:
    ax1.axvline(ii, color='r')
ax1.set_ylabel('Amplitude', fontsize=16)

# Print the RMSE together with onset times superimposed in red
ax2 = fg.add_subplot(2,1,2, sharex=ax1)
ax2.plot(envtm*sr+start, rmse)
for ii in correctedonstm*sr+start:
    ax2.axvline(ii, color='r')
# Plot threshold value superimposed as a black dotted line
ax2.axhline(threshold, linestyle=':', color='k')
ax2.set_ylabel("RMSE", fontsize=16)
ax2.set_xlabel("Sample Number", fontsize=16)

fg.show()

Printed output looks like:

In [1]: %run rosatest
[ 0.17124717  1.88952381  3.74712018  5.62793651]
[   3776.   41664.   82624.  124096.]

and the plot that it produces is shown below:

Thanks. However, if I want to implement an algorithm that will do that for any guitar monophonic sound, how will I proceed without knowing the parts that there is silence? — pavlos163, May 27 '17 at 19:17
I would really hope you give a follow-up to this as I really think it could be a good solution to my problem. I see how this is working but I don't see how it would work without knowing in prior the "silent" period of the signal. — pavlos163, May 28 '17 at 00:09
I'd suggest dividing the recording into a large number of frames (at least several hundred and maybe even a few thousand) and then calculate the RMSE of each. Then choose, say, the 1%, or 2%, or 3% of frames with lowest RMSE, and assume that many of these are silent (any which are not silent will at least be the most pianissimo). Use these frames to estimate your threshold. If the assumption is false and there is < x% silence, this may result in up to x% of the quietest onsets being filtered away incorrectly as noise, but at least you get the correct result the other (100-x)% of the time. — stachyra, May 29 '17 at 20:18
thnx for the nice explanation. as I am new to matplotlib I would be very interested in you adding the code snippet that prints the nice graphs! thnx! (yea, I know, its about two years ago...) — headkit, Mar 27 '19 at 17:35

score 0 · Answer 2 · answered May 26 '17 at 13:59

Did you test to normalize the sound sample before treatment ?

When reading onset_detect documentation we can see that there is a lot of optionals arguments, have you already try to use some ?

Maybe one of this optionals arguments may help you to keep only the good one (or at least limit the size of the onset time returned array):

librosa.util.peak_pick (maybe the best)
backtrack
energy

Please see also an update of your code in order to use a pre-computed onset envelope:

def detect_pitch(y, sr, onset_offset=5, fmin=75, fmax=1400):
  y = highpass_filter(y, sr)

  o_env = librosa.onset.onset_strength(y, sr=sr)
  times = librosa.frames_to_time(np.arange(len(o_env)), sr=sr)

  onset_frames = librosa.onset.onset_detect(y=o_env, sr=sr)
  pitches, magnitudes = librosa.piptrack(y=y, sr=sr, fmin=fmin, fmax=fmax)

  notes = []

  for i in range(0, len(onset_frames)):
    onset = onset_frames[i] + onset_offset
    index = magnitudes[:, onset].argmax()
    pitch = pitches[index, onset]
    if (pitch != 0):
      notes.append(librosa.hz_to_note(pitch))

  return notes

def highpass_filter(y, sr):
  filter_stop_freq = 70  # Hz
  filter_pass_freq = 100  # Hz
  filter_order = 1001

  # High-pass filter
  nyquist_rate = sr / 2.
  desired = (0, 0, 1, 1)
  bands = (0, filter_stop_freq, filter_pass_freq, nyquist_rate)
  filter_coefs = signal.firls(filter_order, bands, desired, nyq=nyquist_rate)

  # Apply high-pass filter
  filtered_audio = signal.filtfilt(filter_coefs, [1], y)
  return filtered_audio

does it work better ?

Recorded audio of one note produces multiple onset times

2 Answers2

Linked