Process large volume of pdfs with OCR

Question

I need to process a large quantity of multipage pdfs (around 23,000 documents and an average of 30 pages) into text. Since the documents are typewritten and scanned I want to use OCR recognition to avoid characters recognition mistakes. The problem is the estimated running time on R (using the Tesseract package) is crazy. Is there an online service provider that can be used for this task?

N.B. I had a look both at Amazon Web Service and Google Cloud, but is extremely difficult for me to understand how to use them, especially how to automate the whole process

You can use the [AWS Service Textract](https://aws.amazon.com/textract/) for that and in GCP you can apparently use [Google Vision](https://cloud.google.com/vision/docs/ocr) for OCR, but a quick google search would have told you that. — Maurice, Mar 04 '21 at 10:06
Yes, but the guides on how to are quite confusing for somebody like me with no prior experience. Do you know some resources that explain in a simple way how to use it? — umbecdl, Mar 04 '21 at 10:16
If you consider to use Google Vision API then get more context for typical OCR workflow in this [thread](https://stackoverflow.com/a/49702469/9928809) and for multiple files processing [here](https://stackoverflow.com/a/51881100/9928809). Does it match your use case? — Nick_Kh, Mar 08 '21 at 09:33
One other option is the [LEADTOOLS Text Extraction Cloud Services](https://services.leadtools.com/documentation/api-reference/extracttext). OCR can be done through a POST request with the configured parameters as shown in the link. As a disclaimer, I work for the vendor of these Cloud Services. If you decide to try them and need assistance, there's free technical support through email and chat. — Amin Dodin, Mar 11 '21 at 16:23
You also consider doing parallel calculations with the R package doParallel to use all the CPU of your computer. With 8 CPU, I have been able to process more than 23 000 documents of more than 30 pages in a reasonable amount of time. — Emmanuel Hamel, Sep 16 '22 at 23:33

Process large volume of pdfs with OCR

0 Answers0