6

Can you explain me what cube mode and Cube Data Files are on Tesseract ocr Engine and what is the advantage of using them?

And how can i train tesseract for Greek to have better results?

George Melidis
  • 599
  • 3
  • 9
  • 25

4 Answers4

6

For those who might be still interested. On Tesseract's website, there are standard trained data sets for different files.

https://code.google.com/p/tesseract-ocr/downloads/list?num=100&start=100

Procedure for training is described here (for version 3.01)

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

In the case of Cube, there is another engine in comparison with Tesseract. It consumes more resources, slower, but gives better results.

Data files -set of files, that should finally lead(be merged into) to a trained data file.

Naresh
  • 334
  • 1
  • 4
  • 17
3

There is an explanation of the various training files required by the Cube engine mode on the tesseract-ocr-extradocs project wiki:

https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube

There you can find detailed (but incomplete) information on how to create the necessary files for training in Cube mode. There's also some information on the neural network file format that might be useful:

https://code.google.com/p/tesseract-ocr-extradocs/wiki/nnFileFormat

Cube mode will often give you better recognition results by using neural networks instead of the adaptive classifier.

I never created Cube training files on my own, so I can't give you more detailed information on how to create these files.

pvorb
  • 7,157
  • 7
  • 47
  • 74
2

For Tesseract 4+ (with LSTM)

I'm not completely sure about cube mode but with --oem 1 you can enable the new LSTM engine and take advantage of the following solutions:

  • Use the existing models

    I would recommend using the pre-trained models available on the Tesseract GitHub repo. They've got a wide variety of languages (and it looks like greek is supported too!)

  • Train it yourself

    I haven't tried this myself but the relevant Wiki on GitHub looks solid.

tl-dr

  • git clone git@github.com:tesseract-ocr/tessdata.git
  • Select the language file you want
  • Move it into your project's tessdata directory
Pranav
  • 666
  • 8
  • 8
0

As far as i can know, PaddleOCR seems a better toolbox for training the OCR models. Of course the trained model it provides performs well on most scenes. You can have a try. :)

Quick start: https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/quickstart_en.md

How to train text detection model: https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/detection_en.md

How to train text recognition model: https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/recognition_en.md

Gry
  • 54
  • 3