I asked a similar question before, in stackoverflow. I wanted to ask another related question, so I am rephrasing the original question again.
I was using PDFBox
to extract image and text from a pdf, available in skydrive and scribd. I had following code for extraction of text:
PDFTextStripper p = new PDFTextStripper();
String thistext=p.getText(document);
Which extracted the text properly. However, when I tried to extract images from the same pdf using ExtractImages
class, the images produced were all pages of the pdf, not the actual images (which should be 1).
It appeared to me that the pdf could be a scanned document. The answer said the fact that it is scanned is your issue
. I tried once more with pdftotext
and pdfimages
. The text is extracted, but pdfimages
output 5 image files, which are all pages of the pdf (same as PDFBox
).
As far I know, the raster images are stored as Xobjects in the pdf. When I opened the pdf with a text editor, I saw 5 appearances of following line:
<< /Type /XObject /Subtype /Image /Name /X /Width 2600 /Height 3799
Which is probably why PDFBox
and XPDF
output 5 pages of the pdf as image files. Then how is the text getting extracted from the pdf? Is there a technical documentation which mentions why (or how) text can be extracted from such a document, where the pages are "supposedly" embedded as XObjects. I can cite the documentation in my report.