Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?
2 Answers
- Using pdfimages from poppler-utils
You might want to take a look at pdfimages from the poppler-utils package.
I have taken the sample pdf from - Sample PDF
On running the following command, images present in the pdf are extracted -
pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image
Some of the images extracted from this brochure are -
So, you can use python's subprocess module to execute this command, and then extract all the images.
Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.
- Using pdfminer
If you want to do this using pdfminer, take a look at this blog post - Extracting Text & Images from PDF Files
Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -
Layout Objects and Tree Structure
Image Source - Pdfminer Docs
Thus, extracting LTFigure objects can help you extract / count images in the pdf document.
Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.

- 1
- 1

- 1,118
- 8
- 26
-
WoW..That was really some good help. But I study a bit layout from PDFMiner but things didn't work for me. I am getting the syntax wrong. Can you please help me with some code? – Hayat Nov 02 '17 at 10:37
-
@hayat If you could point out exactly where you are facing problems, I might be able to help. Maybe this can be of some help ? I would advise you to first fiddle around with the code a little, so as to get a better idea of the layout objects. https://dadruid5.com/2014/08/14/getting-started-extracting-tables-with-pdfminer/ – Ganesh Tata Nov 02 '17 at 12:47
I don't think this can be directly done. Although I have done something similar using the following approach
- Using ghostscript to convert pdf to page images.
- On each page use computer vision (OpenCV) to extract the area of interest(in your case images).

- 2,242
- 16
- 27
-
Thanks, and Is there a way to extract all the images from pdf in python.? – Hayat Nov 02 '17 at 05:34
-
This can be done in python. Try to implement the above approach.Use ghostscript to convert pdf to image – Amarpreet Singh Nov 02 '17 at 05:43
-
-
@Hayat [PDFFigures2](https://github.com/allenai/pdffigures2) by [Allen AI](http://allenai.org) can be used to extract images from PDFs. It is implemented in Scala, but an earlier version [PDFFigures](https://github.com/allenai/pdffigures) is implemented in python. Check it out! – Samyak Jain Nov 21 '17 at 07:03
-
@SamyakJain Thank You. Can you tell me where I can find the tutorial or the study material? – Hayat Nov 21 '17 at 10:25
-
@Hayat Tutorial/Study Material for what? PDFFigures(2) is an implementation of this [paper](http://ai2-website.s3.amazonaws.com/publications/pdf2.0.pdf) – Samyak Jain Nov 21 '17 at 13:22
-
I saw that paper.I mean, how do I use it with python? GitHub files contain c++ files. I don't know the syntax to use it.Have you tried it ever? – Hayat Nov 22 '17 at 05:21