0

I am trying to extract images from a pdf . pdfbox is able to extract images from most of the pdfs but their are some pdfs whose images are not getting extracted by pdfbox.

For extracting the image I am using following code : Not able to extract images from PDFA1-a format document

You can download a sample pdf with this problem from this link : http://myslams.com/test/2.pdf

is their something wrong the code maybe something I forgot to handle or is their something wrong with the pdf all together ?

Community
  • 1
  • 1
sameer singh
  • 169
  • 3
  • 15
  • *For extracting the image I am using following code* - do you mean your code from that other question or the code adapted according to @Tilman's answer. – mkl Jan 26 '15 at 04:58
  • @mkl code adapted according to tilman – sameer singh Jan 26 '15 at 05:45
  • I get *The requested URL /test/2.pdf was not found on this server.* Just like there is no PDF on the URL you mention, there may not be an image on the place you mention. Maybe you are trying to extract a Form XObject assuming that it's an Image XObject. What may be perceived as an image to the human eye, may actually be a bunch of lines and shapes instead of an actual image. – Bruno Lowagie Jan 26 '15 at 07:26
  • I also get a 404, *The requested URL /test/2.pdf was not found on this server.* @sameersingh Please check your links – mkl Jan 26 '15 at 08:57
  • I had to remove the pdfs because of confidentiality .sorry for that ... will be putting in sample PDFs tomorrow – sameer singh Jan 26 '15 at 17:01

1 Answers1

1

As the OP has not yet replaced his stale sample PDF link by a working one, the question can only be answered in general terms.

The code referenced by the OP (with the corrections in the answer of @Tilman) iterates the immediate image resources of each page and stores the respective files.

Thus, the code may store too many images because image resources of a page may not necessarily be used on the page in question:

  1. On one hand it may not be used at all in the file or at least nowhere visible, merely a left-over from some prior PDF editing session.
  2. On the other hand multiple pages may have a shared resources dictionary containing all images on all these pages; in this case the OP's code exports many duplicates.

And the code may store too few images because there are other places where images may be put:

  1. Image data may be directly included in the page content stream, aka inline images.
  2. Constructs with their own resources (form xobjects, patterns, Type 3 font glyphs) used from the page content may provide their own image resources or inline immages.
  3. Annotations, e.g. AcroForm form fields, may have also their own appearance streams with their own resources and, therefore, may provide their own image resources or inline immages, too.
  4. XFA forms may provide their own images, too.

As soon as the OP provides a representative sample file, the type of images he misses can be determined and a specific solution may be outlined.

EDIT

According to a comment by the OP, his image extraction problems have been resolved by making use of the information from this answer to his question "pdfbox and itext extracting image with incorrect dpi". Especially pointing to example code appropriate for the PDFBox version 1.8.8 used by the OP sems to have been important.

Thus, any kind of wrong output may also occur as a result of software orchestration issues.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Your answer here http://stackoverflow.com/questions/28140311/pdfbox-and-itext-extracting-image-with-incorrect-dpi solved my problem . Can I edit your answer above and mark it as answer. – sameer singh Jan 29 '15 at 14:23
  • @sameersingh I hope my edit correctly conveys the resolution reason. – mkl Jan 29 '15 at 15:04