I have a large collection of audio files with their transcripts in a foreign language.
I want to be able to recognize whether the user recites the right words from the text.
How do I start approaching this using CMU Sphinx? Do I need a language model, acoustic model?
I would like some guidance please and where to start from.
Asked
Active
Viewed 247 times
0

amitairos
- 2,907
- 11
- 50
- 84
1 Answers
0
How do I start approaching this using CMU Sphinx?
You recognize the audio and compare it to the transcription. In case of mismatches you can warn your user
Do I need a language model, acoustic model?
Yes, you need both. You can build them from your collection, but you still need a bootstrapped data. To get more advise here it is worth to mention the language.
I would like some guidance please and where to start from.
Start with a tutorial https://cmusphinx.github.io/wiki/tutorial

Nikolay Shmyrev
- 24,897
- 5
- 43
- 87
-
Thanks. The language is Hebrew. Could you please point me to a more specific tutorial? I got lost in all of them. Could you please also give me the steps that I need to do? – amitairos May 15 '17 at 08:19
-
Ok, Hebrew is not supported yet, you have to build the model or use commercial one. – Nikolay Shmyrev May 15 '17 at 08:24
-
Ok. 1. Is there a commercial one ready? Where? 2. Isn't it simpler because I need only the words in my audio and transcripts? If so, what specific approach should I take? – amitairos May 15 '17 at 08:25
-
1Commercial models are available, you can contact me for the reference. Acoustic model training is usually several months of work due to the data collection requirements, but you can go this way too if you have time. You can also consider using Google Speech API, it supports Hebrew. – Nikolay Shmyrev May 15 '17 at 08:43
-
I see. I would like to use Google Speech API. The problem is that it's not that accurate. Is it possible to train it or to make it more accurate by giving it a list of possible words? – amitairos May 15 '17 at 08:45
-
Google API support word hints with SpeechContext https://cloud.google.com/speech/reference/rest/v1/RecognitionConfig. Overall it is hard to guess what is wrong with your accuracy, it depends on too many factors and require detailed analysis. – Nikolay Shmyrev May 15 '17 at 08:53
-
If I want to use CMU Sphinx, and I have a large database of recordings, will it take that long to make the acoustic and language model? Also, could you please tell me the specific steps I need to do? – amitairos May 15 '17 at 09:41
-
Specific steps and data requirements are listed in tutorial https://cmusphinx.github.io/wiki/tutorial, large database of recordings might work but it has to be prepared in a special format. It still takes several months. – Nikolay Shmyrev May 15 '17 at 11:40