Matlab: Finding dominant frequencies in a frame of audio data

Question

I am pretty new to Matlab and I am trying to write a simple frequency based speech detection algorithm. The end goal is to run the script on a wav file, and have it output start/end times for each speech segment. If use the code:

fr = 128;
[ audio, fs, nbits ] = wavread(audioPath);
spectrogram(audio,fr,120,fr,fs,'yaxis')

I get a useful frequency intensity vs. time graph like this:

enter image description here

By looking at it, it is very easy to see when speech occurs. I could write an algorithm to automate the detection process by looking at each x-axis frame, figuring out which frequencies are dominant (have the highest intensity), testing the dominant frequencies to see if enough of them are above a certain intensity threshold (the difference between yellow and red on the graph), and then labeling that frame as either speech or non-speech. Once the frames are labeled, it would be simple to get start/end times for each speech segment.

My problem is that I don't know how to access that data. I can use the code:

[S,F,T,P] = spectrogram(audio,fr,120,fr,fs);

to get all the features of the spectrogram, but the results of that code don't make any sense to me. The bounds of the S,F,T,P arrays and matrices don't correlate to anything I see on the graph. I've looked through the help files and the API, but I get confused when they start throwing around algorithm names and acronyms - my DSP background is pretty limited.

How could I get an array of the frequency intensity values for each frame of this spectrogram analysis? I can figure the rest out from there, I just need to know how to get the appropriate data.

hruske · Answer 1 · 2013-06-09T09:13:25.133

What you are trying to do is called speech activity detection. There are many approaches to this, the simplest might be a simple band pass filter, that passes frequencies where speech is strongest, this is between 1kHz and 8kHz. You could then compare total signal energy with bandpass limited and if majority of energy is in the speech band, classify frame as speech. That's one option, but there are others too.

To get frequencies at peaks you could use FFT to get spectrum and then use peakdetect.m. But this is a very naïve approach, as you will get a lot of peaks, belonging to harmonic frequencies of a base sine.

Theoretically you should use some sort of cepstrum (also known as spectrum of spectrum), which reduces harmonics' periodicity in spectrum to base frequency and then use that with peakdetect. Or, you could use existing tools, that do that, such as praat.

Be aware, that speech analysis is usually done on a frames of around 30ms, stepping in 10ms. You could further filter out false detection by ensuring formant is detected in N sequential frames.

score 1 · Answer 2 · edited Apr 13 '17 at 12:47

1

Why don't you use fft with `fftshift:

  %% Time specifications:
   Fs = 100;                      % samples per second
   dt = 1/Fs;                     % seconds per sample
   StopTime = 1;                  % seconds
   t = (0:dt:StopTime-dt)';
   N = size(t,1);
   %% Sine wave:
   Fc = 12;                       % hertz
   x = cos(2*pi*Fc*t);
   %% Fourier Transform:
   X = fftshift(fft(x));
   %% Frequency specifications:
   dF = Fs/N;                      % hertz
   f = -Fs/2:dF:Fs/2-dF;           % hertz
   %% Plot the spectrum:
   figure;
   plot(f,abs(X)/N);
   xlabel('Frequency (in hertz)');
   title('Magnitude Response');

Why do you want to use complex stuff?

a nice and full solution may found in https://dsp.stackexchange.com/questions/1522/simplest-way-of-detecting-where-audio-envelopes-start-and-stop

edited Apr 13 '17 at 12:47

Community

1
1

answered Nov 27 '12 at 21:15

0x90

39,472
36
165
245

I'm confused - where does the actual audio data come into that equation? – Cbas Nov 27 '12 at 23:13
I mean, I get that I can get whatever data that equation is giving me by doing 'q = 10*log(abs(fftshift(fft(audio))));', but again, I'm not sure what data that is. It's a 335570x1 vector with a min of 0.0218 and a max of 497 - what is it supposed to be representing? – Cbas Nov 27 '12 at 23:32
you should split the buffer to smaller packets and analyze each – 0x90 Nov 28 '12 at 03:42

score 1 · Answer 3 · answered Apr 23 '13 at 13:27

Have a look at the STFT (short-time fourier transform) or (even better) the DWT (discrete wavelet transform) which both will estimate the frequency content in blocks (windows) of data, which is what you need if you want to detect sudden changes in amplitude of certain ("speech") frequencies.

Don't use a FFT since it calculates the relative frequency content over the entire duration of the signal, making it impossible to determine when a certain frequency occured in the signal.

score 0 · Answer 4 · answered Nov 14 '14 at 23:07

0

If you still use inbuilt STFT function, then to plot the maximum you can use following command

plot(T,(floor(abs(max(S,[],1)))))

answered Nov 14 '14 at 23:07

Sujeet

1

Matlab: Finding dominant frequencies in a frame of audio data

4 Answers4

Linked