Custom Dictionary for Tesseract

Question

I am currently working on a project for android using Tesseract OCR. I was hoping to fine-tune the results given to the user by adding a dictionary. According to tesseract OCR wiki, the best way to go about this would be to

Replace tessdata/eng.user-words with your own word list, in the same format - UTF8 text, one word per line.

However there is no eng.user-words file in the tessdata folder, I assume that if I just make a text file with my dictionary in it, it will never be used...

Has anybody had a similar experience and knows what to do?

score 13 · Accepted Answer · edited Sep 22 '20 at 09:11

13

If you're using tesseract 3 (which I assume you are). You'll have to rebuild your eng.trainddata file.

I intended to replace the word-dawg file completely to try to get better results (ie - the words I'm detecting are always the same).

You'll need combine_tessdata and wordlist2dawg executables in the training directory when you compile tesseract.

unpack everything (i did this just to back up my eng.word-dawg, you'll also need the unicharset later)

./combine_tessdata -u eng.traineddata
create a textfile of your wordlist (wordlistfile)
create a eng.word-dawg

./wordlist2dawg wordlistfile eng.word-dawg traineddat_backup/.unicharset
replace the word-dawg file

./combine_tessdata -o eng.traineddata eng.word-dawg

that should be it.

edited Sep 22 '20 at 09:11

Sabito stands with Ukraine

4,271
8
34
56

answered Nov 26 '12 at 00:01

roocell

2,429
25
28

1

I am trying to execute this step 3 but having this error `Loading unicharset from 'traineddat_backup/.unicharset' Failed to load unicharset from 'traineddat_backup/.unicharset'` Kindly help me I am trying to do it on Ubuntu 12.04 and tesseract 3.02. – Muaz Usmani Dec 24 '13 at 20:23
1

@MuhammadMuaz: `traineddat_backup/.unicharset` is the path to folder of the output of 1st cmd. If the first cmd was `./combine_tessdata -u ita.traineddata /path/to/folder/tmp/ita.` the 3rd is `./wordlist2dawg wordlist ita.word-dawg /path/to/folder/tmp/ita.unicharset`. Hope it helps, I throw away 30 minutes on that. – Tenaciousd93 Dec 04 '14 at 11:04

Custom Dictionary for Tesseract

1 Answers1

Linked