0

I am trying to build customised scorer (language model) for speech-to-text using DeepSpeech in colab. While calling generate_lm.py getting this error:

    main()
  File "generate_lm.py", line 201, in main
    build_lm(args, data_lower, vocab_str)
  File "generate_lm.py", line 126, in build_lm
    binary_path,
  File "/usr/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/content/DeepSpeech/native_client/kenlm/build/bin/build_binary', '-a', '255', '-q', '8', '-v', 'trie', '/content/DeepSpeech/data/lm/lm_filtered.arpa', '/content/DeepSpeech/data/lm/lm.binary']' died with <Signals.SIGSEGV: 11>.```

Calling the script generate_lm.py like this :

```! python3 generate_lm.py --input_txt hindi_tokens.txt --output_dir /content/DeepSpeech/data/lm --top_k 500000 --kenlm_bins /content/DeepSpeech/native_client/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie```
Anjaly Vijayan
  • 237
  • 2
  • 9
  • Some questions to clarify your environment: - How have you installed the `KenLM` binaries? - Are they pre-built? What's occurring here is a `SIGSEGV` - or segmentation violation. This means that the program is trying to access a location in memory that it's not authorised to use. When this occurs with `KenLM` - the software used by `generate_lm.py` - it's usually because the wrong binary is being used. This is explained in more detail in the DeepSpeech PlayBook - https://mozilla.github.io/deepspeech-playbook/SCORER.html – Kathy Reid Nov 20 '21 at 23:56
  • It's not pre-built. Found kenlm folder in DeepSpeech/native_client but, there was no any build folder in that. So cloned kenlm git repo and installed it myself. – Anjaly Vijayan Nov 21 '21 at 09:42
  • Did you install with `pip install https://github.com/kpu/kenlm/archive/master.zip` ? – Kathy Reid Nov 21 '21 at 11:30
  • Not by pip. Cloned git repo of kenlm and followed the instructions given. Like, ```mkdir -p build cd build cmake .. make -j 4``` – Anjaly Vijayan Nov 21 '21 at 12:55
  • OK, that should still work. Is there any way you can do a stack trace from Colab? – Kathy Reid Nov 21 '21 at 13:02
  • I am not sure about that @KathyReid – Anjaly Vijayan Nov 23 '21 at 02:46

1 Answers1

0

Able to find a solution for the above question. Successfully created language model after reducing the value of top_k to 15000. My phrases file has about 42000 entries only. We have to adjust top_k value based on the number of phrases in our collection. top_k parameter says - this much of less frequent phrases will be removed before processing.

Anjaly Vijayan
  • 237
  • 2
  • 9