0

During processing of tiff files, which are having 600 - 700 pages from Tesseract OCR engine with hocr option, we monitored that files are taking around 40 - 50 minutes.

We monitored that it is so much time for processing large files.

Do we have any way to speed up the process?

Following command is using: -

<Drive>:\Tesseract-OCR>tesseract.exe "Source_Tiff_File" "Destination_File" hocr
  • In the svn there is a new contribution from AMD which uses OpenGL and seems to brings a speed improvement of 50% or more. – tobltobs Apr 21 '15 at 08:10
  • Hi, out of curiosity how many `mb` is the 600-700 page `tif`? maybe you could try some of the new GPU instances from `aws` or `azure` and leverage the `OpenCL` and `Cuda` patches mentioned above (Not `OpenGL`). I'm hoping to try this for one of my projects also. – joefromct Apr 24 '16 at 15:25

1 Answers1

1

You can split up the multi-page TIFF and run them in multiple processes.

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • Because, we read text and coordinate from html file; so again I will have to merge the data and coordinate from all html files, and consolidate into one that will be extra overhead in processing. –  Worldprogram Apr 19 '15 at 05:16