I would like to use offline speech to text recognition, mostly for German language.
Especially, I want to use Mozilla DeepSpeech (a TensorFlow implementation of Baidu's DeepSpeech architecture), but I fear that the audio quality of the audio input is not good enough to produce low error rates (WER - word error rates).
(English) example:
The speaker said "know" but the engine might have understood "flow" or "show" or "go" or "know".
I would like to get [flow, show, go, know]
back from the engine, so that I can afterwards manually decide which suggestion fits best. How can I get this?
Does other speech to text engines offer this possibility?