1

I just created a language model from a short text file. I did this for both English and Dutch, primarily to reduce recognition times by decreasing the possiblilities. I both created them using the Sphinx toolkit and the basesphinx lm to binary converter. The dutch language model can be found here: http://pastebin.com/txkxiAc6 The English one can be found here: http://pastebin.com/fr3Epj5b They are both small, but the english one recognizes everything it needs to recognize.

The Dutch one uses the Dutch Voxforge pack and dictionary. The English one uses cmusphinx-en-us-8khz-5.2.tar.gz and the default dictionary from pocketsphinx.

The code goes is like this:

Public static main(){
     configuration = new Configuration();
     configuration.setAcousticModelPath("src/main/resources/"+language+"/model");
     configuration.setDictionaryPath("src/main/resources/"+language+"/dict.dict");
     configuration.setLanguageModelPath("src/main/resources/"+language+"/model.lm.bin");
     context = new Context(configuration);
     recognizer = context.getInstance(Recognizer.class);
     recognizer.allocate();

     ----------GET INPUT STREAM AND SEND TO METHOD-------------

      RecognizeText(inputstream,outputstream)
}

private static String RecognizeText(InputStream stream, OutputStream os) throws Exception {
        context.setSpeechSource(stream, TimeFrame.INFINITE);
        Result result;
        while ((result = recognizer.recognize()) != null) {
            SpeechResult speechResult = new SpeechResult(result);
            return speechResult.getHypothesis();
        }
        return "";
    }

The 'language' variable can be set to Dutch or English for the correct language. English works, but Dutch doesn't. Where is my error? I can't seem to find it.

The Dutch Acoustic Model folder contains the following:

feat.params
mdef
means
mixture_weights
noisedict
transition_matrices
variances
peter
  • 55
  • 1
  • 8

1 Answers1

0

Dutch model was very old, it has not been updated for 5 years. I've just uploaded a new model on cmusphinx website.

https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Dutch/

It should be more accurate but still it is trained only with 13 hours of data. English models are trained with 1000+ hours. We need more transcribed Dutch data.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • The [Spoken Wikipedia Corpora](https://nats.gitlab.io/swc/) provide more transcribed Dutch data. 224 hours, according to their accepted manuscript. – Emiel Apr 17 '18 at 10:48
  • Well, if you want already working model, there is this project https://github.com/opensource-spraakherkenning-nl/Kaldi_NL – Nikolay Shmyrev Apr 19 '18 at 02:33
  • Yes, I know. I was just trying to help out others who may want to train their own model. – Emiel Apr 20 '18 at 06:55