4

I have Tesseract running in python via pytesseract.

Using a image of a newspaper article which happens to contain around 600 words, the pytesseract.image_to_string function takes around 20 seconds to complete.

The eventual results are great, but it is of little use with it being so slow.

The image has a file size of 3.5MB and a resolution 3024 × 4032 (in case it is useful). It has had preprocessing completed on it via opencv.

The approx 18 to 20 seconds time period is the case both running on my local machine, and also when uploaded to the Google Cloud platform.

Is there anything that anyone can recommend to speed up this process?

The pytesseract version used is 0.2.5.

Ali AzG
  • 1,861
  • 2
  • 18
  • 28
user3795126
  • 109
  • 2
  • 5
  • You can try using tessdata from https://github.com/tesseract-ocr/tessdata_fast ; That would decrease OCR quality but improve performance. You can also scale down your image. That would also decrease quality but improve performance. – Dmitrii Z. Dec 08 '18 at 12:52
  • Does your image have lots of noise? this can slow down tesseract a lot – Lance Dec 08 '18 at 13:09
  • You should down-scale the image to around 300dpi. – user3169 Dec 08 '18 at 21:54
  • @DmitriiZ. Thanks for the comment. I tried scaling down, but the speed was still slow for anything that gave reasonable results. I was unable to figure out how to add the tesseract_fast library. – user3795126 Dec 10 '18 at 08:50
  • @LachlanLindsay An example of the image (after the preprocessing has been completed) can be found here: https://drive.google.com/file/d/1rJ_KoGcAmsAvbCdo6GfhQBbMJ1TJ2vDC/view?usp=sharing – user3795126 Dec 10 '18 at 08:52
  • tesseract_fast is not a library, it is tessdata which you are using for processing. Find your tessdata folder and replace your `eng.traineddata` (if you are using english) with the one from tessdata_fast repo. At the side note - I would expect that without any preprocessing 3kx4k image would take around 18-20 seconds. You can try detecting text on your image in order to reduce the amount of input for tesseract. – Dmitrii Z. Dec 10 '18 at 08:53
  • @user3169 It would appear that the image is currently at 72 pixels per inch. An example can be found here: https://drive.google.com/file/d/1rJ_KoGcAmsAvbCdo6GfhQBbMJ1TJ2vDC/view?usp=sharing – user3795126 Dec 10 '18 at 08:53
  • Also considering your example image - removing black border would increase performance. – Dmitrii Z. Dec 10 '18 at 08:54

0 Answers0