HowTo extract embedded OCR data from a PDF?

Question

I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data. So my question is, is it possible to extract this embedded OCR-Data from the pdf Files? It would be nice to get files with coordinates. But it would also be sufficient to get plaintext files.

i don't need a specific language. The best would be, if I could use it within a Batch-Skript. So a commandline tool would be nice. By the way. I want to use it on Windows... — erik, Mar 03 '11 at 07:05

score 0 · Answer 1 · answered Mar 02 '11 at 17:04

You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.

PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.

HowTo extract embedded OCR data from a PDF?

1 Answers1