First let me explain my goal. The goal I'm working towards is providing an input .wav file, sending it into some kind of Speech Recognition API, and returning a text file with the transcription. The application I have in mind is very simple. I do not require that it be parsed for grammar or punctuation. It can return a big, long sentence -- that's fine. I will treat each transcribed word as an observation in a text file (.tsv or .csv format)
However, the one tricky bit of data (tricky because 95% of all 3rd party audio transcription services I've reviewed don't provide this kind of data to the user) that I do need is the [0.00 - 1.00] confidence score of each word the SR takes its guess on. I would like to store that data in a new column of the text file that contains the transcribed text either in .tsv or .csv format.
That's it. That's my goal. It seems my goal is possible: here is a quote from an expert in a related post:
Convert Audio(Wav file) to Text using SAPI?
SAPI can certainly do what you want. Start with an in-proc recognizer, connect up your audio as a file stream, set dictation mode, and off you go.
and here is the relevant documentation for .wav transcription confidence scores:
https://msdn.microsoft.com/en-us/library/jj127911.aspx
Everyone makes it sound so simple, but now let me explain the problem; why I'm posting a question. The problem is that, for me, my goal is out of reach because I know next to nothing about c++ or COM. I thought that SAPI was part of the everday windows experience and had a dedicated, friendly user interface. So I grew increasingly alarmed the more I researched this procedure. However I still believe that in principle this is a very simple thing, so I'm optimistic.
I have knowledge in Python and a little JS. I know Python has code magic for other languages, so I'm sure Python can interface with SAPI this way, but since I don't know c++, I don't think that would make me any better off.
So just to reiterate, despite the skill mismatch, I'm still partial to SAPI because all the user friendly alternatives, like Dragon, Nuance, Chrome plug-ins, ect, don't provide the data granularity I need.
Now let me get to the heart of my question:
- Can someone give me their assessment on the difficulty of my "goal" as described above? Could it be done in a single .bat file? Example code would be greatly appreciated.