Tesseract5-OCR Train - Segmentation fault error

Question

I am trying to train tesseract 5 on a new font. Am running tesseract on WSL Ubuntu and I followed tutorial by Gabriel Garcia and the official tesseract Compilation docs. Am trying to train tesseract on top of the eng.traineddata file from tessdata_best which i included in the tesseract/tessdata directory. I have also provided the training data (tif, box, gt files) in the tesstrain/data/$(MODEL_NAME)-ground-truth directory.

When i run the train command

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=eng START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100

i get the following results

user@DESKTOP:~/tesseract-ocr/tesstrain$ TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Calibri START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
combine_tessdata -u ../tesseract/tessdata/eng.traineddata data/eng/Calibri
Extracting tessdata components from ../tesseract/tessdata/eng.traineddata
Wrote data/eng/Calibri.lstm
Wrote data/eng/Calibri.lstm-punc-dawg
Wrote data/eng/Calibri.lstm-word-dawg
Wrote data/eng/Calibri.lstm-number-dawg
Wrote data/eng/Calibri.lstm-unicharset
Wrote data/eng/Calibri.lstm-recoder
Wrote data/eng/Calibri.version
Version:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
unicharset_extractor --output_unicharset "data/Calibri/my.unicharset" --norm_mode 2 "data/Calibri/all-gt"
Extracting unicharset from plain text file data/Calibri/all-gt
Other case É of é is not in unicharset
Wrote unicharset file data/Calibri/my.unicharset
merge_unicharsets data/eng/Calibri.lstm-unicharset data/Calibri/my.unicharset "data/Calibri/unicharset"
Loaded unicharset of size 112 from file data/eng/Calibri.lstm-unicharset
Loaded unicharset of size 112 from file data/Calibri/my.unicharset
Wrote unicharset file data/Calibri/unicharset.
python3 shuffle.py 0 "data/Calibri/all-lstmf"
/bin/bash: line 2: bc: command not found
/bin/bash: line 5: bc: command not found
+ head -n '' data/Calibri/all-lstmf
head: invalid number of lines: ''
+ tail -n '' data/Calibri/all-lstmf
tail: invalid number of lines: ''
+ '[' '' = Windows_NT ']'
if [ "" = "Windows_NT" ]; then \
        dos2unix "data/Calibri/Calibri.numbers"; \
        dos2unix "data/Calibri/Calibri.punc"; \
        dos2unix "data/Calibri/Calibri.wordlist"; \
        dos2unix "data/langdata/Calibri/Calibri.config"; \
fi
combine_lang_model \
  --input_unicharset data/Calibri/unicharset \
  --script_dir data/langdata \
  --numbers data/Calibri/Calibri.numbers \
  --puncs data/Calibri/Calibri.punc \
  --words data/Calibri/Calibri.wordlist \
  --output_dir data \
   \
  --lang Calibri
Failed to read data from: data/Calibri/Calibri.wordlist
Failed to read data from: data/Calibri/Calibri.punc
Failed to read data from: data/Calibri/Calibri.numbers
Loaded unicharset of size 112 from file data/Calibri/unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 47 = ~
Config file is optional, continuing...
Failed to read data from: data/langdata/Calibri/Calibri.config
Null char=2
Created data/Calibri/Calibri.traineddatalstmtraining \
  --debug_interval 0 \
  --traineddata data/Calibri/Calibri.traineddata \
  --old_traineddata ../tesseract/tessdata/eng.traineddata \
  --continue_from data/eng/Calibri.lstm \
  --learning_rate 0.0001 \
  --model_output data/Calibri/checkpoints/Calibri \
  --train_listfile data/Calibri/list.train \
  --eval_listfile data/Calibri/list.eval \
  --max_iterations 100 \
  --target_error_rate 0.01
Failed to load list of training filenames from data/Calibri/list.train
make: *** [Makefile:324: data/Calibri/checkpoints/Calibri_checkpoint] Error 1

I tried manually adding a path to an lstm file in the list.train file. the error

Failed to load list of training filenames from data/Calibri/list.train

the above error stopped and when i run the train command again I now got this error

user@DESKTOP:~/tesseract-ocr/tesstrain$ TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Calibri START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
lstmtraining \
  --debug_interval 0 \
  --traineddata data/Calibri/Calibri.traineddata \
  --old_traineddata ../tesseract/tessdata/eng.traineddata \
  --continue_from data/eng/Calibri.lstm \
  --learning_rate 0.0001 \
  --model_output data/Calibri/checkpoints/Calibri \
  --train_listfile data/Calibri/list.train \
  --eval_listfile data/Calibri/list.eval \
  --max_iterations 100 \
  --target_error_rate 0.01
Loaded file data/eng/Calibri.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 111!
Num (Extended) outputs,weights in Series:
  1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  TxyLfys64:64, 20736
  Lfx96:96, 61824
  RxLrx96:96, 74112
  Lfx512:512, 1247232
  Fc111:111, 56943
Total weights = 1461007
Previous null char=110 mapped to 110
Continuing from data/eng/Calibri.lstm
make: *** [Makefile:324: data/Calibri/checkpoints/Calibri_checkpoint] Segmentation fault

I've search the internet and the closest thing I found was this issue that was opened on the tesseract github page. The issue raised in this post fixed by changing the traineddata file from a fast traineddata to best traineddata. but this did not work for me.

Thanks in advance

When trying to solve problems I always find it best to start with the _FIRST_ problem and solve that, then do the next one etc. The reason is that often solving the first one may also allow other issues further down to be fixed as well. Starting with the _LAST_ problem doesn't work out so well. Here your **first** problem is the error `/bin/bash: line 2: bc: command not found` which means you haven't installed the `bc` program in your WSL instance. I recommend you do that first, then see what happens. — MadScientist, Mar 25 '23 at 20:00
@MadScientist Thanks for the insight, I don't know how I overlooked the error but let me do as you've said, hopefully, it will resolve some of my issues. — Algocoder, Mar 27 '23 at 19:04
@MadScientist Thank you so much for the help, I've been stuck on this problem for weeks now and all i needed to do was install bc package. Mad gratitudes to you. — Algocoder, Mar 27 '23 at 19:17

Tesseract5-OCR Train - Segmentation fault error

0 Answers0