I am using the Google API for speech recognition.
I am using 2.5 sec audio samples. Below, you can see an example of output where the confidence is omitted:
{u'alternative': [{u'transcript': u'if Carol comes tomorrow have a'}, {u'transcript': u'if Carroll comes tomorrow never'}, {u'transcript': u'if Carroll comes tomorrow have a'}, {u'transcript': u'if Carole comes tomorrow have a'}, {u'transcript': u'if care comes tomorrow have a'}, {u'transcript': u'if Carroll comes tomorrow however'}, {u'transcript': u'if girl comes tomorrow have a'}, {u'transcript': u'is Carroll comes tomorrow have a'}, {u'transcript': u'if call comes tomorrow have a'}, {u'transcript': u'Carol comes tomorrow have a'}, {u'transcript': u'if kevin comes tomorrow have a'}, {u'transcript': u'if Carroll comes tomorrow have'}, {u'transcript': u'if korea comes tomorrow have a'}, {u'transcript': u'if Carroll come tomorrow have a'}, {u'transcript': u'if cry comes tomorrow have a'}], u'final': True}
The original sample is partially cut at the end, but definitely says: "if Carol comes tomorrow have a..."
In 95% of the cases, I get the confidence value only for the very first sentence, all the alternatives are omitted:
{u'alternative': [{u'confidence': 0.91297865, u'transcript': u'by that time perhaps something better can'}, {u'transcript': u'by that time perhaps something better came'}, {u'transcript': u'by that time perhaps something better Kim'}, {u'transcript': u'but that time perhaps something better can'}, {u'transcript': u'by that time perhaps something better come'}], u'final': True}
Here the sentence is: "By that time perhaps something better can be". So the first transcription is pretty much accurate.
Just in case, this is how I run the evaluation in Python:
import speech_recognition as sr
from scipy.io import wavfile
r = sr.Recognizer()
with sr.WavFile(target0_path) as source:
audio = r.record(source)
list = r.recognize_google(audio, None, "en-US", True)
Do you have any idea or advice? Any particular settings I could use to avoid the problem?