0

I am using pdfbox to extract image and text from this pdf. I have following code for extraction of text:

 PDFTextStripper p = new PDFTextStripper();
 String thistext=p.getText(document);

Which extracts the text properly. However, when I try to extract images from the same pdf using ExtractImages class, the images produced are all pages of the pdf, not the actual images. Is that because of the reason that the pdf might be a scanned copy? If that is true, how come the text is extracted?

rivu
  • 2,004
  • 2
  • 29
  • 45

1 Answers1

1

I believe the fact that it is scanned is your issue. While I have seen scanned PDFs detect text (and make it highlightable), it is still an image. To test this hypothesis, I would try using a known good PDF such as this one.

supersam654
  • 3,126
  • 33
  • 34
  • Thanks for the prompt reply. Yes I have tested with other pdfs for which it worked. I was confused about text recognition in scanned documents. – rivu Jan 31 '13 at 03:01