Tesseract Ocr Engine Cube mode - Training Tesseract

Question

Can you explain me what cube mode and Cube Data Files are on Tesseract ocr Engine and what is the advantage of using them?

And how can i train tesseract for Greek to have better results?

score 6 · Answer 1 · edited Dec 10 '18 at 10:04

For those who might be still interested. On Tesseract's website, there are standard trained data sets for different files.

https://code.google.com/p/tesseract-ocr/downloads/list?num=100&start=100

Procedure for training is described here (for version 3.01)

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

In the case of Cube, there is another engine in comparison with Tesseract. It consumes more resources, slower, but gives better results.

Data files -set of files, that should finally lead(be merged into) to a trained data file.

score 3 · Answer 2 · answered May 19 '14 at 09:49

There is an explanation of the various training files required by the Cube engine mode on the tesseract-ocr-extradocs project wiki:

https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube

There you can find detailed (but incomplete) information on how to create the necessary files for training in Cube mode. There's also some information on the neural network file format that might be useful:

https://code.google.com/p/tesseract-ocr-extradocs/wiki/nnFileFormat

Cube mode will often give you better recognition results by using neural networks instead of the adaptive classifier.

I never created Cube training files on my own, so I can't give you more detailed information on how to create these files.

Pranav · Answer 3 · 2018-06-20T15:08:27.607

For Tesseract 4+ (with LSTM)

I'm not completely sure about cube mode but with --oem 1 you can enable the new LSTM engine and take advantage of the following solutions:

Use the existing models

I would recommend using the pre-trained models available on the Tesseract GitHub repo. They've got a wide variety of languages (and it looks like greek is supported too!)
Train it yourself

I haven't tried this myself but the relevant Wiki on GitHub looks solid.

tl-dr

git clone git@github.com:tesseract-ocr/tessdata.git
Select the language file you want
Move it into your project's tessdata directory

This is not OP's answer. – Andrew Ravus Apr 26 '19 at 07:53 — Andrew Ravus, Apr 26 '19 at 07:53

score 0 · Answer 4 · answered Aug 16 '22 at 17:47

As far as i can know, PaddleOCR seems a better toolbox for training the OCR models. Of course the trained model it provides performs well on most scenes. You can have a try. :)

Quick start: https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/quickstart_en.md

How to train text detection model: https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/detection_en.md

How to train text recognition model: https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/recognition_en.md

Tesseract Ocr Engine Cube mode - Training Tesseract

4 Answers4

For Tesseract 4+ (with LSTM)

Use the existing models

Train it yourself

tl-dr