0

I am trying to extract images from PDF and got a code from StackOverflow. It is working fine for some of the pdf but not for all. I saw a pattern that pdf which has a number of pages more than 8-10, it is not extracting anything.I think I am missing something minute here. Please help me figure out. This is the code I am using and here is the link to pdf resources

import PyPDF2
import sys
from PIL import Image
import os
import glob
from PyPDF2 import PdfFileReader
def ExtractImages(filename):
    print("\n---------------------------------------")
    print("This is the pdf processing",filename)

    fileObject = PyPDF2.PdfFileReader(open(filename, "rb"))
    print(fileObject)
    pages = fileObject.getNumPages()
    print("Total number of Pages is.....",pages)
    for i in range(2,pages):
        tempPage = fileObject.getPage(i)
        if '/XObject' in tempPage['/Resources']:
            xObject = tempPage['/Resources']['/XObject'].getObject()
            for obj in xObject:
                if xObject[obj]['/Subtype'] == '/Image':
                    size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                    data = xObject[obj].getData()
                    if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                        mode = "RGB"
                    else:
                        mode = "P"
                    if '/Filter' in xObject[obj]:
                        if xObject[obj]['/Filter'] == '/FlateDecode':

                            img = Image.frombytes(mode, size, data)
                            img.save(obj[1:] + ".png")
                        elif xObject[obj]['/Filter'] == '/DCTDecode':
                            img = open(obj[1:] + ".jpg", "wb")
                            img.write(data)
                            img.close()
                        elif xObject[obj]['/Filter'] == '/JPXDecode':
                            img = open(obj[1:] + ".jp2", "wb")
                            img.write(data)
                            img.close()
                        elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                            img = open(obj[1:] + ".tiff", "wb")
                            img.write(data)
                            img.close()
                    else:
                        img = Image.frombytes(mode, size, data)
                        img.save(obj[1:] + ".png")
        else:
            print("No image found for file.",filename)

listOfFiles = glob.glob('./*.pdf')
for file in listOfFiles:
    ExtractImages(file)
  • [Don't post code and errors as images](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) – tripleee Nov 22 '17 at 10:44
  • You just edited an answer into your question. Please undo the edit and post that as an answer and you can accept it. – Dan D. Dec 12 '17 at 12:46
  • @DanD.I edited the last question. Is my question clear? – Sarwar Hayatt Dec 12 '17 at 13:12
  • You are aware that your code only looks for bitmap images? Vector graphics images won't show up. And you are aware that your code only looks at the page resources? Inline images or images in resources of form XObjects, patterns, etc. won't show up. With this on your mind, which image from which of your example files surprisingly is not extracted? – mkl Dec 12 '17 at 23:01
  • @mkl Thank you for taking me across the situation but when I give page number I does extract from particular pdf page.Eg `tempPage = fileObject.getPage(8)` From file `118641.pdf` This line then extracts the image from file. – Sarwar Hayatt Dec 13 '17 at 09:32

1 Answers1

0

Ubuntu 16.04 - amd64 : No errors here.

sudo apt install libpoppler-dev libleptonica-dev

git clone https://github.com/allenai/pdffigures.git
cd pdffigures/
make              // The executable 'pdffigures' gets created.
./pdffigures
Knud Larsen
  • 5,753
  • 2
  • 14
  • 19