Is there any way to speed up extraction using tesseract OCR Engine, while tiff file is having 600-700 pages?

Question

During processing of tiff files, which are having 600 - 700 pages from Tesseract OCR engine with hocr option, we monitored that files are taking around 40 - 50 minutes.

We monitored that it is so much time for processing large files.

Do we have any way to speed up the process?

Following command is using: -

<Drive>:\Tesseract-OCR>tesseract.exe "Source_Tiff_File" "Destination_File" hocr

In the svn there is a new contribution from AMD which uses OpenGL and seems to brings a speed improvement of 50% or more. — tobltobs, Apr 21 '15 at 08:10
Hi, out of curiosity how many `mb` is the 600-700 page `tif`? maybe you could try some of the new GPU instances from `aws` or `azure` and leverage the `OpenCL` and `Cuda` patches mentioned above (Not `OpenGL`). I'm hoping to try this for one of my projects also. — joefromct, Apr 24 '16 at 15:25

score 1 · Answer 1 · answered Apr 19 '15 at 04:02

1

You can split up the multi-page TIFF and run them in multiple processes.

answered Apr 19 '15 at 04:02

nguyenq

8,212
1
16
16

Because, we read text and coordinate from html file; so again I will have to merge the data and coordinate from all html files, and consolidate into one that will be extra overhead in processing. – Worldprogram Apr 19 '15 at 05:16

Is there any way to speed up extraction using tesseract OCR Engine, while tiff file is having 600-700 pages?

1 Answers1