Slicing audio signal to detect pitch

Question

I am using Librosa to transcribe monophonic guitar audio signals.

I thought that, it would be a good start to "slice" the signal depending on the onset times, to detect note changes at the correct time.

Librosa provides a function that detects the local minima before the onset times. I checked those timings and they are correct.

Here is the waveform of the original signal and the times of the minima.

[ 266240  552960  840704 1161728 1427968 1735680 1994752]

The melody played is E4, F4, F#4 ..., B4.

Therefore the results should ideally be: 330Hz, 350Hz, ..., 493Hz (approximately).

As you can see, the times in the minima array, represent the time just before the note was played.

However, on a sliced signal (of 10-12 seconds with only one note per slice), my frequency detection methods have really poor results. I am confused because I can't see any bugs in my code:

  y, sr = librosa.load(filename, sr=40000)

  onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
  oenv = librosa.onset.onset_strength(y=y, sr=sr)

  onset_bt = librosa.onset.onset_backtrack(onset_frames, oenv)

  # Converting those times from frames to samples.
  new_onset_bt = librosa.frames_to_samples(onset_bt)

  slices = np.split(y, new_onset_bt[1:])
  for i in range(0, len(slices)):
    print freq_from_hps(slices[i], 40000)
    print freq_from_autocorr(slices[i], 40000)
    print freq_from_fft(slices[i], 40000)

Where the freq_from functions are taken directly from here.

I would assume this is just bad precision from the methods, but I get some crazy results. Specifically, freq_from_hps returns:

1.33818658287
1.2078047577
0.802142642257
0.531096911977
0.987532329094
0.559638134414
0.953497587952
0.628980979055

These values are supposed to be the 8 pitches of the 8 corresponding slices (in Hz!).

freq_from_fft returns similar values whereas freq_from_autocorr returns some more "normal" values but also some random values near 10000Hz:

242.748000585
10650.0394232
275.25299319
145.552578747
154.725859019
7828.70876515
174.180627765
183.731497068

This is the spectrogram from the whole signal:

And this is, for example, the spectrogram of slice 1 (the E4 note):

As you can see, the slicing has been done correctly. However there are several issues. First, there is an octave issue in the spectrogram. I was expecting some issues with that. However, the results I get from the 3 methods mentioned above are just very weird.

Is this an issue with my signal processing understanding or my code?

score 3 · Accepted Answer · edited May 23 '17 at 12:34

Is this an issue with my signal processing understanding or my code?

Your code looks fine to me.

The frequencies you want to detect are the fundamental frequencies of your pitches (the problem is also known as "f0 estimation").

So before using something like freq_from_fft I'd bandpass filter the signal to get rid of garbage transients and low frequency noise—the stuff that's in the signal, but irrelevant to your problem.

Think about, which range your fundamental frequencies are going to be in. For an acoustic guitar that's E2 (82 Hz) to F6 (1,397 Hz). That means you can get rid of anything below ~80 Hz and above ~1,400 Hz (for a bandpass example, see here). After filtering, do your peak detection to find the pitches (assuming the fundamental actually has the most energy).

Another strategy might be, to ignore the first X samples of each slice, as they tend to be percussive and not harmonic in nature and won't give you much information anyway. So, of your slices, just look at the last ~90% of your samples.

That all said, there is a large body of work for f0 or fundamental frequency estimation. A good starting point are ISMIR papers.

Last, but not least, Librosa's piptrack function may do just what you want.

Slicing audio signal to detect pitch

1 Answers1