I need to determine when someone speaks in an audio stream. I applied the Hamming window and calculated the FFT. How do i detect the human voice from here?
2 Answers
If you want to experiment with your own voice activity detection algorithms, an FFT can be used as an initial stage. Next you might want to try subtracting any characterized stationary spectral noise background. Then you could try using the modified FFT results to calculate a cepstrum (or some weighted cepstral coefficients) for feature extraction. You could then do some statistical pattern matching on whatever feature vectors you decided to extract, and feed the results to a decision algorithm.
Each of the above steps has likely been a research topic, and a good implementation might involve studying dozens of published research papers, which perhaps can be found in your university library.

- 70,107
- 14
- 90
- 153
You don't need to do an FFT for this, you need to implement a Voice Activity Detection algorithm.

- 208,748
- 37
- 389
- 560
-
Well i'd want to detect the voice from the FFT. Could i do this? – user1019710 Dec 03 '11 at 21:16
-
1It's not clear why you would want to re-invent the wheel when there are established algorithms for VAD - did you read the Wikipedia page I linked to ? – Paul R Dec 03 '11 at 21:59
-
Yes, i read it, and i haven't found anything relevant to my question. – user1019710 Dec 04 '11 at 09:28
-
OK - you could try following the links in the article e.g. to the G729 standard. – Paul R Dec 04 '11 at 12:22