0

I need to process a large quantity of multipage pdfs (around 23,000 documents and an average of 30 pages) into text. Since the documents are typewritten and scanned I want to use OCR recognition to avoid characters recognition mistakes. The problem is the estimated running time on R (using the Tesseract package) is crazy. Is there an online service provider that can be used for this task?

N.B. I had a look both at Amazon Web Service and Google Cloud, but is extremely difficult for me to understand how to use them, especially how to automate the whole process

umbecdl
  • 21
  • 3
  • You can use the [AWS Service Textract](https://aws.amazon.com/textract/) for that and in GCP you can apparently use [Google Vision](https://cloud.google.com/vision/docs/ocr) for OCR, but a quick google search would have told you that. – Maurice Mar 04 '21 at 10:06
  • Yes, but the guides on how to are quite confusing for somebody like me with no prior experience. Do you know some resources that explain in a simple way how to use it? – umbecdl Mar 04 '21 at 10:16
  • If you consider to use Google Vision API then get more context for typical OCR workflow in this [thread](https://stackoverflow.com/a/49702469/9928809) and for multiple files processing [here](https://stackoverflow.com/a/51881100/9928809). Does it match your use case? – Nick_Kh Mar 08 '21 at 09:33
  • One other option is the [LEADTOOLS Text Extraction Cloud Services](https://services.leadtools.com/documentation/api-reference/extracttext). OCR can be done through a POST request with the configured parameters as shown in the link. As a disclaimer, I work for the vendor of these Cloud Services. If you decide to try them and need assistance, there's free technical support through email and chat. – Amin Dodin Mar 11 '21 at 16:23
  • You also consider doing parallel calculations with the R package doParallel to use all the CPU of your computer. With 8 CPU, I have been able to process more than 23 000 documents of more than 30 pages in a reasonable amount of time. – Emmanuel Hamel Sep 16 '22 at 23:33

0 Answers0