Using pdfbox, why text can be extracted, but not image

Question

I am using pdfbox to extract image and text from this pdf. I have following code for extraction of text:

 PDFTextStripper p = new PDFTextStripper();
 String thistext=p.getText(document);

Which extracts the text properly. However, when I try to extract images from the same pdf using ExtractImages class, the images produced are all pages of the pdf, not the actual images. Is that because of the reason that the pdf might be a scanned copy? If that is true, how come the text is extracted?

score 1 · Answer 1 · answered Jan 31 '13 at 02:59

1

I believe the fact that it is scanned is your issue. While I have seen scanned PDFs detect text (and make it highlightable), it is still an image. To test this hypothesis, I would try using a known good PDF such as this one.

answered Jan 31 '13 at 02:59

supersam654

3,126
33
34

Thanks for the prompt reply. Yes I have tested with other pdfs for which it worked. I was confused about text recognition in scanned documents. – rivu Jan 31 '13 at 03:01

Using pdfbox, why text can be extracted, but not image

1 Answers1

Linked