5

First of all I'd like to state that my question is not per say about the "classic" definition of voice recognition.

What we are trying to do is somewhat different, in the sense of:

  1. User records his command
  2. Later, when the user will speak pre-recorded command, a certain action will occur.

For example, I record a voice command for calling my mom, so I click on her and say "Mom". Then when I use the program and say "Mom", it will automatically call her.

How would I perform the comparison of a spoken command to a saved voice sample?

EDIT: We have no need for any "text-to-speech" abilities, solely a comparison of sound signals. Obviously we're looking for some sort of a off-the-shelf product or framework.

Ron Rejwan
  • 59
  • 1
  • 3
  • 1
    Like I said, how is it possible to achieve what I've asked :) – Ron Rejwan Apr 05 '11 at 17:29
  • Just to clear this issue up, we have no need for any sort of "Text to speech" or anything of the sort, we're looking for a relatively simple framework that can compare 2 sound signals and see if they are "the same". This way even non English speaking people can use this program. – Ron Rejwan Apr 05 '11 at 19:24
  • Have you found an valid answer for this question? – VansFannel Mar 19 '12 at 14:07

5 Answers5

5

One way this is done for music recognition is to take a time sequence of frequency spectrums (time windowed STFT FFTs) for the two sounds in question, map the locations of the frequency peaks over the time axis, and cross-correlate the two 2D time-frequency peak mappings for a match. This is far more robust than just cross-correlating the 2 sound samples, as the peaks change far less than all the spectral "cruft" between the spectral peaks. This method will work better if the rate of the two utterances and their pitch haven't changed too much.

In iOS 4.x, you can use the Accelerate framework for the FFTs and maybe the 2D cross correlations as well.

hotpaw2
  • 70,107
  • 14
  • 90
  • 153
0

Try using a third-party library, like OpenEars for iOS applications. You could have users record a voice sample and save it as translated text, or just let them enter text for recognition.

Dominic
  • 3,304
  • 19
  • 22
  • I don't even need to translate said voice command into text, I simply want to store said command, and later compare it. – Ron Rejwan Apr 05 '11 at 19:18
  • 1
    No, you really need voice recognition. Comparing sounds for "equality" does not take into account any of the many ways the second recorded sample could differ from the first. Car drives by in the background? User pauses slightly longer between words? Or stutters? Be forgiving to your users - they're human, and not capable of producing the exact same sound twice. – Dominic Apr 05 '11 at 19:59
0

I think you'd have to perform some sort of cross correlation to determine how similar these two signals are. (Assuming it'll be the same user that is speaking ofcourse). I'm just typing this answer out to see if it helps, but I'd wait for a better answer from someone else though. My signal processing skills are close to zero.

Tejaswi Yerukalapudi
  • 8,987
  • 12
  • 60
  • 101
  • Cross correlation seems like what we need for the project, as we want it to be universal (and not just for English speaking customers) – Ron Rejwan Apr 05 '11 at 19:19
0

I'm not sure if your question is about the DSP or how to do it on the iPhone. If it is the latter I would start with the Speak Here project that Apple provides. That way you already have the interface to record the voice to a file done. It will save you a lot of trouble.

Eric Brotto
  • 53,471
  • 32
  • 129
  • 174
0

I'm using Visqol for this purpose. The docs say it works best with a short sample, ideally 5-10 sec.You also need to prepare the files in terms of sample rate and they need to be .wav files. You can easily convert your files to the desired format with ffmpeg library. https://github.com/google/visqol

eva
  • 43
  • 5