Audio File to Text Using SAPI or Equally Capable SR

Question

First let me explain my goal. The goal I'm working towards is providing an input .wav file, sending it into some kind of Speech Recognition API, and returning a text file with the transcription. The application I have in mind is very simple. I do not require that it be parsed for grammar or punctuation. It can return a big, long sentence -- that's fine. I will treat each transcribed word as an observation in a text file (.tsv or .csv format)

However, the one tricky bit of data (tricky because 95% of all 3rd party audio transcription services I've reviewed don't provide this kind of data to the user) that I do need is the [0.00 - 1.00] confidence score of each word the SR takes its guess on. I would like to store that data in a new column of the text file that contains the transcribed text either in .tsv or .csv format.

That's it. That's my goal. It seems my goal is possible: here is a quote from an expert in a related post:

Convert Audio(Wav file) to Text using SAPI?

SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream, set dictation mode, and off you go.

and here is the relevant documentation for .wav transcription confidence scores:

https://msdn.microsoft.com/en-us/library/jj127911.aspx

https://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.recognizedwordunit.confidence(v=office.14).aspx

Everyone makes it sound so simple, but now let me explain the problem; why I'm posting a question. The problem is that, for me, my goal is out of reach because I know next to nothing about c++ or COM. I thought that SAPI was part of the everday windows experience and had a dedicated, friendly user interface. So I grew increasingly alarmed the more I researched this procedure. However I still believe that in principle this is a very simple thing, so I'm optimistic.

I have knowledge in Python and a little JS. I know Python has code magic for other languages, so I'm sure Python can interface with SAPI this way, but since I don't know c++, I don't think that would make me any better off.

So just to reiterate, despite the skill mismatch, I'm still partial to SAPI because all the user friendly alternatives, like Dragon, Nuance, Chrome plug-ins, ect, don't provide the data granularity I need.

Now let me get to the heart of my question:

Can someone give me their assessment on the difficulty of my "goal" as described above? Could it be done in a single .bat file? Example code would be greatly appreciated.

It's a fairly opened ended question, I'd be happy to have any bit of input, no matter how big or small. Just bear in mind I'm not the most adept programmer in the world. So I might not grasp all these domain specific terms, but I'm doing my best. — Arash Howaida, Oct 09 '16 at 14:49

score 5 · Accepted Answer · answered Oct 12 '16 at 15:16

It probably goes without saying, but I think you're going to find it difficult to work with SAPI's C interface if you don't have a strong handle on C as a language. I wrote a program that does almost exactly what you're talking about some time ago to test the concept. First a code dump:

#include "dirent.h"
#include <iostream>
#include <string>
#include <sapi.h>
#include <sphelper.h>

int main(int argc, char* argv[]){

    DIR *dir;
    struct dirent* entry;
    struct stat* statbuf;
    ::CoInitialize(NULL);
    if((dir = opendir(".")) != NULL){
        while((entry = readdir(dir)) != NULL){
            char extCheck[260];
            strcpy(extCheck, entry->d_name);
            if(strlen(extCheck) > 4 && !strcmp(strlwr(extCheck) + strlen(extCheck)-4, ".wav")){
                //printf("%s\n",entry->d_name);
                //1. Find the wav files
                //2. Check the wavs to make sure they're the correct format
                //3. Output any errors to the error log
                //4. Produce the text files for the wavs
                //5. Cleanup and exit
                FILE* fp;
                std::string fileName = std::string(entry->d_name,entry->d_name + strlen(entry->d_name)-4);
                fileName += ".txt";
                fp = fopen(fileName.c_str(), "w+");
                HRESULT hr = S_OK;
                CComPtr<ISpStream> cpInputStream;
                CComPtr<ISpRecognizer> cpRecognizer;
                CComPtr<ISpRecoContext> cpRecoContext;
                CComPtr<ISpRecoGrammar> cpRecoGrammar;
                CSpStreamFormat sInputFormat;
                hr = cpRecognizer.CoCreateInstance(CLSID_SpInprocRecognizer);
                hr = cpInputStream.CoCreateInstance(CLSID_SpStream);
                hr = sInputFormat.AssignFormat(SPSF_16kHz16BitStereo);
                std::string sInputFileName = entry->d_name;
                std::wstring wInputFileName = std::wstring(sInputFileName.begin(), sInputFileName.end());
                hr = cpInputStream->BindToFile(wInputFileName.c_str(), SPFM_OPEN_READONLY, &sInputFormat.FormatId(), sInputFormat.WaveFormatExPtr(), SPFEI_ALL_EVENTS);
                hr = cpRecognizer->SetInput(cpInputStream, TRUE);
                hr = cpRecognizer->CreateRecoContext(&cpRecoContext);
                hr = cpRecoContext->CreateGrammar(NULL, &cpRecoGrammar);
                hr = cpRecoGrammar->LoadDictation(NULL,SPLO_STATIC);

                hr = cpRecoContext->SetNotifyWin32Event();
                auto hEvent = cpRecoContext->GetNotifyEventHandle();
                hr = cpRecoContext->SetInterest(SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_END_SR_STREAM), SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_END_SR_STREAM));
                hr = cpRecoGrammar->SetDictationState(SPRS_ACTIVE);
                BOOL fEndStreamReached = FALSE;
                unsigned int timeOut = 0;
                //WaitForSingleObject(hEvent, INFINITE);
                while (!fEndStreamReached && S_OK == cpRecoContext->WaitForNotifyEvent(INFINITE)){
                    CSpEvent spEvent;

                     while (!fEndStreamReached && S_OK == spEvent.GetFrom(cpRecoContext)){

                        switch (spEvent.eEventId){

                            case SPEI_RECOGNITION:
                                {
                                    auto pPhrase = spEvent.RecoResult();
                                    SPPHRASE *phrase = nullptr;// new SPPHRASE();
                                    LPWSTR* text = new LPWSTR(L"");
                                    pPhrase->GetText(SP_GETWHOLEPHRASE, SP_GETWHOLEPHRASE, TRUE, text, NULL);
                                    pPhrase->GetPhrase(&phrase);

                                    if(phrase != NULL && phrase->pElements != NULL) {
                                        std::wstring wRuleName = L"";

                                        if(nullptr != phrase && phrase->Rule.pszName != NULL) {
                                            wRuleName = phrase->Rule.pszName;
                                        }

                                        std::wstring recognizedText = L"";
                                        bool firstWord = true;
                                        for(ULONG i = 0; i < (ULONG)phrase->Rule.ulCountOfElements; ++i) {

                                            if(phrase->pElements[i].pszDisplayText != NULL) {

                                                std::wstring outString = phrase->pElements[i].pszDisplayText;
                                                std::string soutString = std::string(outString.begin(), outString.end());
                                                if(!firstWord){
                                                    soutString = " " + soutString;
                                                    firstWord = false;
                                                }
                                                soutString = soutString + " ";
                                                fputs(soutString.c_str(),fp);
                                                /*if(recognizedText != L"") {
                                                    recognizedText += L" " + outString;
                                                } else {
                                                    recognizedText += outString;
                                                }*/
                                            }
                                        }

                                    }
                                    delete[] text;

                                    break;
                                }

                            case SPEI_END_SR_STREAM:
                                {
                                    fEndStreamReached = TRUE;
                                    break;
                                }

                        }

                        // clear any event data/object references
                        spEvent.Clear();
                    }
                }

                hr = cpRecoGrammar->SetDictationState(SPRS_INACTIVE);
                hr = cpRecoGrammar->UnloadDictation();
                hr = cpInputStream->Close();

                fclose(fp);
            }
        }
        closedir(dir);
    } else {
        perror("Error opening directory");
    }

    ::CoUninitialize();

    std::printf("Press any key to continue...");
    std::getchar();
    return 0;
}

I haven't run this in a long time, but you'll have to get dirent.h for it work. I was playing around with that library for no other reason than to try it out.

With the code provided you could probably start looking at what confidence values are getting generated at the recognition step. You could also tweak this to run from a batch file if you wanted to.

The problems that I faced were the following:

Accuracy was a problem, and in order to improve it I would have to train the recognizer, which was going to require a lot more time than I had.
I found that direct translation into text wasn't what I really wanted after all. As it turns out the phoneme data is quite a bit more important. With that you can form your own confidence scheme, and develop your own alternatives specific to your application.
Window's Recognizer, while good, isn't going to recognize words that it doesn't know about. You'll have to figure out how to add your vocabulary to windows' speech recognizer lexicon.

With that said, it's not a trivial undertaking to use the stock windows desktop speech recognizer. I'd take a look at some existing APIs out there. If you're not limited to client side only applications, you'd do well to look into other APIs.

All your points are well taken. thank you for demonstrating what SAPI is capable of, as well as its limitations. This is exactly what I needed to assess the future of my SR project. — Arash Howaida, Oct 13 '16 at 19:53
Wow, I just checked this C/C++ code under VS2019 and it works as is ! (with a few minor changes). Accuracy seems quite good, I tested a WAV file created with SAPI in TTS mode with "Microsoft David" voice. Question, what changes should be done in the above code to listen live dictation ? Also, it uses the first SR it founds. How to select another SR ? (I have 3 languages installed). Thanks ! — Carl, Jul 08 '20 at 15:50
it works great for WAV files. But if I use an MP3 file, the line `hr = cpInputStream->BindToFile(wInputFileName.c_str(), SPFM_OPEN_READONLY, &sInputFormat.FormatId(), sInputFormat.WaveFormatExPtr(), SPFEI_ALL_EVENTS);` returns `The Parameter is incorrect`. The question is, can I use MP3 files as input for SAPI? and if yes, how do I determine the correct format for the call to `hr = sInputFormat.AssignFormat(SPSF_16kHz16BitStereo)` because `SPSF_16kHz16BitStereo` will certainly not be correct and I don't think we should hardcode it. — Sam, Mar 01 '22 at 17:58

score 2 · Answer 2 · answered Oct 10 '16 at 17:42

Quite honestly, this is fairly difficult given the approach you're describing in your question. The existing SAPI engines either don't do dictation at all (e.g., the "server" engine available via Microsoft.Speech.Recognition) or require training to learn the particulars of a given voice (e.g., the "desktop" engine available via System.Speech.Recognition).

The Windows Runtime recognizer (Windows.Media.SpeechRecognition) supports dictation and provides a confidence value, but does not support recognition from a stream.

With the approach you're describing, I would use the Bing Speech API, as it provides the confidence values you want via the REST API.

Audio File to Text Using SAPI or Equally Capable SR

2 Answers2

Linked