Getting the amplitude(or rms voltage) of audio signal captured in C++ by wavin lib.?

Question

I am working on a very basic robotics project, and wish to implement voice recognition in it. i know its a complex thing but i wish to do it for only 3 or 4 commands(or words).

i know that using wavin i can record audio. but i wish to do real-time amplitude analysis on the audio signal, how can that be done, the wave will be inputed as 8-bit, mono.

i have thought of divinding the signal into a set of some specific time, further diving it into smaller subsets, getting the average rms value over the subset and then summing them up and then see how much different they are from the actual stored signal.If the error is below accepted value for all(or most) of the sets, then print the word.

How can this be implemented? if you can provide me any other suggestion also, it would be great.

Thanks, in advance.

Pierre · Accepted Answer · 2011-04-05T02:30:18.583

There is no simple way to recognize words, because they are basically a sequence of phonemes which can vary in time and frequency.

Classical isolated word recognition systems use signal MFCC (cepstral coefficients) as input data, and try to recognize patterns using HMM (hidden markov models) or DTW (dynamic time warping) algorithms.

You will also need a silence detection module if you don't want a record button.

For instance Edimburgh University toolkit provides some of these tools (with good documentation).

If you don't want to build it "from scratch" or have a source of inspiration, here is an (old but free) implementation of such a system (which uses its own toolkit) with a full explanation and practical examples on how it works.

This system is a LVCSR (Large-Vocabulary Continuous Speech Recognition) and you only need a subset of it. If someone know an open source reduced vocabulary system (like a simple IVR) it would be welcome.

If you want to make a basic system from your own, I recommend you to use MFCC and DTW:

For each target word to modelize:
- record some instances of the word
- compute some (eg each 10ms) delta-MFCC through the word to have a model
When you want to recognize a signal:
- compute some delta-MFCC of this signal
- use DTW to compare these delta-MFCC to each modelized word's delta-MFCC
- output the word that fits the best (use a threshold to drop garbage)

Any thoughts why the comparison of rms value of subsets, wont do the work, there will be only one person who will be dictating the thing and the vocab is only of few words.. — TarunG, Apr 03 '11 at 07:30
A same speaker can prononunce a same word with different energy, frequency, speed and rythm. The footprint of a word resides in the variation of frequency not in variation of energy. That's why you really should use MFCCs instead of rms values. To cope with speed and rythm, DTW is the simplest way. You just cannot recognize a word without doing such an alignment between the word and the reference. — Pierre, Apr 05 '11 at 02:17
Note to use delta-MFCC (the derivative of MFCC) to get the variation of the energy for each frequency (I slightly changed my answer). Also note that @Michael cited sphynx that provides a C++ implementation called pocketsphynx (I wasn't aware of it) although it's a LVCSR (so based on phonemes and using a Language Model, things that you don't need). — Pierre, Apr 05 '11 at 02:36

score 1 · Answer 2 · edited May 23 '17 at 12:04

If you just want to recognize a few commands, there are many commercial and free products you can use. See Need text to speech and speech recognition tools for Linux or What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition? or Speech Recognition on iPhone. The answers to these questions link to many available products and tools. Speech recognition and understanding of a list of commands is a very common problem solved commercially. Many of the voice automated phone systems you call uses this type of technology. The same technology is available for developers.

From watching these questions for few months, I've seen most developer choices break down like this:

Windows folks - use the System.Speech features of .Net or Microsoft.Speech and install the free recognizers Microsoft provides. Windows 7 includes a full speech engine. Others are downloadable for free. There is a C++ API to the same engines known as SAPI. See at http://msdn.microsoft.com/en-us/magazine/cc163663.aspx. or http://msdn.microsoft.com/en-us/library/ms723627(v=vs.85).aspx
Linux folks - Sphinx seems to have a good following. See http://cmusphinx.sourceforge.net/ and http://cmusphinx.sourceforge.net/wiki/
Commercial products - Nuance, Loquendo, AT&T, others
Online service - Nuance, Yapme, others

Of course this may also be helpful - http://en.wikipedia.org/wiki/List_of_speech_recognition_software

Sapi seems well enough to do the work, but i am acquainted with borland c++ compiler only, should i learn VC++ or C# if i need to implement SAPI, also any resource that may help in transitioning from C++ to vc++ would be great. Thank you. — TarunG, Apr 07 '11 at 06:15
SAPI is just a standard COM API and is part of the Windows SDK. You should be able to program it using any C++ compiler. See http://en.wikipedia.org/wiki/Microsoft_Speech_API#SAPI_5.3 for some useful info and links. C# and the System.Speech namespace in .NET certainly make developing speech enabled applications easier, but you don't have to learn a new language to add speech to your existing application. — Michael Levy, Apr 07 '11 at 13:43
http://msdn.microsoft.com/en-us/library/bb756992.aspx should be helpful for you too. — Michael Levy, Apr 07 '11 at 13:45

Getting the amplitude(or rms voltage) of audio signal captured in C++ by wavin lib.?

2 Answers2