I'm using Tesseract to do OCR on millions of PDFs, and I'm trying to squeeze out as much performance as I can.
My current pipeline uses convert
to convert a PDF to PNG files (one per page), and then uses Tesseract on each of those.
During profiling, I've discovered that a lot of time is spent writing files to disk, then reading them again, so I'd like to move all of this into memory.
I've got the PDF to PNG conversion working in memory, so now I need a way to pass the in-memory blob to Tesseract instead of giving it a path to a file? I haven't been able to find any documentation or examples of this?