Count Images in a pdf document through python

Question

Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

Using pdfimages from poppler-utils

You might want to take a look at pdfimages from the poppler-utils package.

I have taken the sample pdf from - Sample PDF

On running the following command, images present in the pdf are extracted -

pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image

Some of the images extracted from this brochure are -

Extracted Image1

Extracted Image 2

So, you can use python's subprocess module to execute this command, and then extract all the images.

Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.

Using pdfminer

If you want to do this using pdfminer, take a look at this blog post - Extracting Text & Images from PDF Files

Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -

Layout Objects and Tree Structure

Image Source - Pdfminer Docs

Thus, extracting LTFigure objects can help you extract / count images in the pdf document.

Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.

WoW..That was really some good help. But I study a bit layout from PDFMiner but things didn't work for me. I am getting the syntax wrong. Can you please help me with some code? — Hayat, Nov 02 '17 at 10:37
@hayat If you could point out exactly where you are facing problems, I might be able to help. Maybe this can be of some help ? I would advise you to first fiddle around with the code a little, so as to get a better idea of the layout objects. https://dadruid5.com/2014/08/14/getting-started-extracting-tables-with-pdfminer/ — Ganesh Tata, Nov 02 '17 at 12:47

score 0 · Answer 2 · answered Nov 02 '17 at 04:32

0

I don't think this can be directly done. Although I have done something similar using the following approach

Using ghostscript to convert pdf to page images.
On each page use computer vision (OpenCV) to extract the area of interest(in your case images).

answered Nov 02 '17 at 04:32

Amarpreet Singh

2,242
16
27

Thanks, and Is there a way to extract all the images from pdf in python.? – Hayat Nov 02 '17 at 05:34
This can be done in python. Try to implement the above approach.Use ghostscript to convert pdf to image – Amarpreet Singh Nov 02 '17 at 05:43
Thanks. Do you have any reference to code so I can look into? – Hayat Nov 02 '17 at 05:52
@Hayat [PDFFigures2](https://github.com/allenai/pdffigures2) by [Allen AI](http://allenai.org) can be used to extract images from PDFs. It is implemented in Scala, but an earlier version [PDFFigures](https://github.com/allenai/pdffigures) is implemented in python. Check it out! – Samyak Jain Nov 21 '17 at 07:03
@SamyakJain Thank You. Can you tell me where I can find the tutorial or the study material? – Hayat Nov 21 '17 at 10:25
@Hayat Tutorial/Study Material for what? PDFFigures(2) is an implementation of this [paper](http://ai2-website.s3.amazonaws.com/publications/pdf2.0.pdf) – Samyak Jain Nov 21 '17 at 13:22
I saw that paper.I mean, how do I use it with python? GitHub files contain c++ files. I don't know the syntax to use it.Have you tried it ever? – Hayat Nov 22 '17 at 05:21

Count Images in a pdf document through python

2 Answers2