why from scanned documents, text can be extracted, but not image

Question

I asked a similar question before, in stackoverflow. I wanted to ask another related question, so I am rephrasing the original question again.

I was using PDFBox to extract image and text from a pdf, available in skydrive and scribd. I had following code for extraction of text:

 PDFTextStripper p = new PDFTextStripper();
 String thistext=p.getText(document);

Which extracted the text properly. However, when I tried to extract images from the same pdf using ExtractImages class, the images produced were all pages of the pdf, not the actual images (which should be 1).

It appeared to me that the pdf could be a scanned document. The answer said the fact that it is scanned is your issue. I tried once more with pdftotext and pdfimages. The text is extracted, but pdfimages output 5 image files, which are all pages of the pdf (same as PDFBox).

As far I know, the raster images are stored as Xobjects in the pdf. When I opened the pdf with a text editor, I saw 5 appearances of following line:

<< /Type /XObject /Subtype /Image /Name /X /Width 2600 /Height 3799

Which is probably why PDFBox and XPDF output 5 pages of the pdf as image files. Then how is the text getting extracted from the pdf? Is there a technical documentation which mentions why (or how) text can be extracted from such a document, where the pages are "supposedly" embedded as XObjects. I can cite the documentation in my report.

Unfortunately your PDF reference does require some kind of login. — mkl, Feb 12 '13 at 23:05
@mki , Thanks for looking into it. I posted the pdf in skydrive and made it open to everyone. Is there any other filehosting service you would recommend? I can use that. — rivu, Feb 12 '13 at 23:07
Hhmmm, i just tested it from my phone and I could download it. Did I simply overlook the download without log-in in my regular browser? Well, I'll be looking into that tomorrow. — mkl, Feb 12 '13 at 23:14
ok, i added a scribd link. please see if you can download it. — rivu, Feb 12 '13 at 23:17
As a first guess, though: the PDF format allows you to have a single (e.g. Scanned) image visible and additionally (e.g. ocr'ed) text behind the letters in the image for copy & paste and other kinds of text extraction. The visible characters in your document look like they are part of an image. In that case the PDF simply contains both, an image of the whole page and text data beneath it to extract. — mkl, Feb 12 '13 at 23:22
Hmm. I get your idea. Is there a technical documentation which mentions this? Because I need to cite something to validate your thought. Thanks for helping though. — rivu, Feb 12 '13 at 23:27
The technical documentation would be the ISO standard defining PDF, i.e. ISO 32000-1:2008, "soon" to be updated to ISO 32000-2... When you read it, you'll see that nothing keeps you from first drawing text and then putting an image above it our even first putting an image there and then drawing invisible text above it. — mkl, Feb 12 '13 at 23:34
@mkl, thanks for your help. You seem to know a lot about PDFs. I had another question here (http://stackoverflow.com/questions/14846560/can-pdfbox-extract-vector-images). It would be of great help if you could please look into that. — rivu, Feb 13 '13 at 05:30
I do know the format PDF somewhat but I have not yet worked with PDFbox. I'm afraid, therefore, I cannot help with that other question. — mkl, Feb 13 '13 at 06:19

mkl · Accepted Answer · 2013-02-15T19:11:31.037

Having inspected your PDF file the first guess in the comments to your question has been confirmed...

Your sample document is scanned and essentially consists of one bitmap image per page. When you zoom into the document, you can quickly see that all content looks fairly pixel'ish.

All the images have a resolution of 2600x3799 and are black and white.

These images have furthermore been OCR'ed and the resulting text has been invisibly added to the pages which allows for selecting, copying & pasting.

E.g. have a look at the top of page 885:

top of your page 885

Its content stream starts like this:

1 0 0 1 -0.5998 -0.4801 cm
1 1 1 rg
1 i 
/RelativeColorimetric ri
/GS0 gs
0 0 469.2 684.7 re
f
q
467.9972 0 0 683.8015 0.6014 0.4492 cm
/Im0 Do
Q

Here /Im0, the page image, is inserted

1 0 0 1 0.5998 0.4801 cm
0 0 0 rg
BT
/TT0 1 Tf
3 Tr 9.8 0 0 10.4 35.8002 640.4199 Tm

Here addition of text is prepared; especially have a look at 3 Tr: This oparation sets the text rendering mode to 3 which is Neither fill nor stroke text (invisible). (section 9.3.6 Text Rendering Mode in ISO 32000-1:2008)

(A )Tj
/TT1 1 Tf
-0.01 Tc 8.8 0 0 9.5 43.4002 640.4199 Tm
(%gust )Tj

Here you see text added, starting with an 'A ' and an '%gust '. This actually shows that the result of the OCR'ing does not seem to have been properly checked as that should have been 'August'. The low quality text information continues:

A %gust , 1978 SHORT PAPERS 885
where
and also
Similarly for B. Also,
T, = AY-l T
as a result of the adiabatic cooling of the vapour.
Stage 2:
Here a volume of vapour and a volume of liquid I are removed and replaced with an
equal volume of air containing concentrations Y and s of A and B, respectively. Of course,
r or s may either or both be negligibly small, with subsequent simplification.

As you see many special characters and formulas have not or not correctly been recognized.

Thanks for the detailed answer. I sort of suspected that, but I didn't know about `3 Tr` command. I can now cite this. — rivu, Feb 15 '13 at 18:06

why from scanned documents, text can be extracted, but not image

1 Answers1