3

I'm trying to get the pitch class from recorded voice (44.1 kHz) using autocorrelation. What I'm doing is basically described here: http://cnx.org/content/m11714/latest/ and also implemented here: http://code.google.com/p/yaalp/source/browse/trunk/csaudio/WaveAudio/WaveAudio/PitchDetection.cs (the part using PitchDetectAlgorithm.Amdf)

So in order to detect the pitch class I build up an array with the normalized correlation for the frequencies of C2 to B3 (2 octaves) and select the one with the highest value (doing a "1 - correlation" transformation first so not searching for minimum but maximum)

I tested it with generated audio (simple sinus):

data[i] = (short)(Math.Sin(2 * Math.PI * i/fs * freq) * short.MaxValue);

But it only works for input frequencies lower than B4. Investigating the generated array I found that starting from G3 another peek evolved that eventually gets bigger than the correct one. and my B4 is detected as an E. Changing the number of analysed frequencies did not help at all.

My buffer size is 4000 samples and frequency of B4 is ~493Hz, so I cannot think of a reason why this is failing. Are there any more constraints on the frequencies or buffer sizes? What is going wrong there?

I'm aware that I could use FFT like Performous is using, but using this method looked simple and also gives weighted frequencies that can be used to show visualisations. I don't want to throw it away that easily and at least understand why this fails.

Update: Core function used:

private double _GetAmdf(int tone)
    {
        int samplesPerPeriod = _SamplesPerPeriodPerTone[tone]; // samples in one period
        int accumDist = 0; // accumulated distances
        int sampleIndex = 0; // index of sample to analyze
        // Start value= index of sample one period ahead
        for (int correlatingSampleIndex = sampleIndex + samplesPerPeriod; correlatingSampleIndex < _AnalysisBufLen; correlatingSampleIndex++, sampleIndex++)
        {
            // calc distance (correlation: 1-dist/IntMax*2) to corresponding sample in next period (0=equal .. IntMax*2=totally different)
            int dist = Math.Abs(_AnalysisBuffer[sampleIndex] - _AnalysisBuffer[correlatingSampleIndex]);
            accumDist += dist;
        }

        return 1.0 - (double)accumDist / Int16.MaxValue / sampleIndex;
    }

With that function, the pitch/tone is (pseudocode)

tone = Max(_GetAmdf(tone)) <- for tone = C2..

I also tried using actual autocorrelation with:

double accumDist=0;
//...
double dist = _AnalysisBuffer[sampleIndex] * _AnalysisBuffer[correlatingSampleIndex];
//...
const double scaleValue = (double)Int16.MaxValue * (double)Int16.MaxValue;
return accumDist / (scaleValue * sampleIndex);

but that fails getting an A3 as an D in addition to B4 as an E

Note: I do not divide by Bufferlength but by the number of samples actually compared. Not sure if this is right, but it seems logic.

Flamefire
  • 5,313
  • 3
  • 35
  • 70

3 Answers3

2

This is the common octave problem with using autocorrelation and similar lag estimations of pitch (AMDF, ASDF, etc.)

A frequency that is one octave (or any other integer multiple) lower will also give as good a match in shifted waveform similarity (e.g. a sinewave shifted by 2pi looks the same as one shifted by 4pi, which represents an octave lower. Depending on noise and how close the continuous peak is to the sampled peak, one or the other estimation peak may be slightly higher, with no change in pitch.

So some other test needs to be used to remove lower octave (or other submultiple frequency) peaks in the waveform correlation or lag matching (e.g. does a peak look close enough like one or more other peaks, one or more octaves or other frequency multiples up, etc.)

hotpaw2
  • 70,107
  • 14
  • 90
  • 153
  • I do not care on which octave the result is, it only has to be the right note class. So if I put an input of a B4 sinus I'd be happy with a B2 output as long as it is a B. Or did I misunderstood you? – Flamefire Mar 22 '14 at 19:37
  • The 3rd submultiple or 3rd multiple of a B pitch is not at the frequency of a B pitch. Same for any non-power-of-2. – hotpaw2 Mar 22 '14 at 20:13
1

I don't know c#, but if the tiny amount of code you've supplied is correct and like most other c-like languages, it is introducing a huge amount of what's called intermodular distortion.

In most c-like languages (and most other languages I know, like java), the output of something like Math.sin() would be in the range [-1,1]. Upon casting to an int, short or long, this would change to [-1,0]. Essentially, you will have changed your sine wave to a very distorted square wave with many overtones, which may be what these libraries are picking up.

Try this:

data[i] = (short)(32,767 * Math.Sin(2 * Math.PI * i/fs * freq));
Bjorn Roche
  • 11,279
  • 6
  • 36
  • 58
  • right, just noticed I did omit an important detail by removing unnecessary code. I actually use what you suggested – Flamefire Mar 22 '14 at 12:54
  • You should update your question with a more complete, but still short, example. You may also want to add the c# tag and remove the voice tag. – Bjorn Roche Mar 22 '14 at 16:05
  • 1
    Done. It's basically the same as in the linked code. Not going to change the tags, as the question is related to voice(-pitch-detection), that influences the algorithm used, and not only to C#. Would happily accept answers in other languages as well. – Flamefire Mar 22 '14 at 17:59
0

besides all that was spoken by @Bjorn and @Hotpaw, in the past i found the problems described by @hotpaw2.

Was not clear from your code if you are computing with difference of one sample (as I have ever seen in equations to compute AMDF) !

I did in java, you can find in Tarsos the full source code !

Here the equivalent steps from your post in java:

    int maxShift = audioBuffer.length;


    for (int i = 0; i < maxShift; i++) {
        frames1 = new double[maxShift - i + 1];
        frames2 = new double[maxShift - i + 1];
        t = 0;
        for (int aux1 = 0; aux1 < maxShift - i; aux1++) {
            t = t + 1;
            frames1[t] = audioBuffer[aux1];

        }
        t = 0;
        for (int aux2 = i; aux2 < maxShift; aux2++) {
            t = t + 1;
            frames2[t] = audioBuffer[aux2];
        }

        int frameLength = frames1.length;
        calcSub = new double[frameLength];
        for (int u = 0; u < frameLength; u++) {
            calcSub[u] = frames1[u] - frames2[u];
        }

        double summation = 0;
        for (int l = 0; l < frameLength; l++) {
            summation +=  Math.abs(calcSub[l]);
        }
        amd[i] = summation;
    }
ederwander
  • 3,410
  • 1
  • 18
  • 23
  • How is your code different than mine? It is just a more complicated version of mine: for (int i = 0; i < maxShift; i++) { double summation = 0; for (int l = 0; l < maxshift-i; l++) { summation += Math.abs(audioBuffer[l] - frames2[i]); } amd[i] = summation; } – Flamefire Mar 22 '14 at 20:32
  • I'm not expert in C#, but in your source I not know if your samplesPerPeriod is equal 1 (are computing with difference of one sample), all difference of this codes is how the minimum position will be treated to ensure that does not have problems with bugs octave – ederwander Mar 22 '14 at 20:41
  • samplesPerPeriod is never 1. I only compare tones (from C2 to Cx) so samplesPerPeriod is between say 75 to 674 – Flamefire Mar 22 '14 at 21:02
  • may be you are very close to the border to find the correct frequency, 75 say me that in theory you can not find frequencies above 588hz, and B4 can be near of the border, you can try start from 42 to 674 and see if now it can track the pitch for B4.... – ederwander Mar 22 '14 at 21:29
  • Does not help. However if I don't do a rounding when getting samplesPerPeriod from the frequency I get better results in a couple of tests – Flamefire Mar 22 '14 at 22:41