7

I'm using Tesseract to do OCR on millions of PDFs, and I'm trying to squeeze out as much performance as I can.

My current pipeline uses convert to convert a PDF to PNG files (one per page), and then uses Tesseract on each of those.

During profiling, I've discovered that a lot of time is spent writing files to disk, then reading them again, so I'd like to move all of this into memory.

I've got the PDF to PNG conversion working in memory, so now I need a way to pass the in-memory blob to Tesseract instead of giving it a path to a file? I haven't been able to find any documentation or examples of this?

mlissner
  • 17,359
  • 18
  • 106
  • 169
  • If you don't get a full answer to this question, a work-around is to save the image file to a RAM disk. (Many Linux distributions now create RAM disks by default.) – John1024 Aug 23 '16 at 21:14
  • Yeah, that was my instinct too, but we don't have that and it's outside my stack to make that kind of change. – mlissner Aug 23 '16 at 21:16
  • 2
    `tesseract` can process `stdin`... – Mark Setchell Aug 23 '16 at 21:51
  • That does seem to be promising, @MarkSetchell. I'm digging into that now, figuring out how that would work. – mlissner Aug 23 '16 at 22:03
  • Looks like this feature was added in 3.04, so I've upgraded to that, and it seems to be working. – mlissner Aug 25 '16 at 19:33
  • Hi @mlissner Have you figured it out? – Anmol Deep Apr 11 '23 at 06:32
  • 1
    No, but we made [a microservice](https://free.law/projects/doctor), that you can scale horizontally. If you wanted to be real clever, you could deploy the microservice to a machine where /tmp is mapped to a memory filesystem. – mlissner Apr 11 '23 at 22:48

0 Answers0