0

I get PDF from other Department with huge pages (like 1500). This PDF is compilation of subdistrict documents in a district. To make sure of this data, I want to extract data from this PDF. First try, I use PDFMiner to extract the text but this approach cost much time.

My Strategy is to reduce the pages size with extract the first of every subdistrict document in the PDF which is the resume page. I can identify this page because every this page has image. To extract the page, I need the index page with pictures (var page_images). So, I write my code like this:

import fitz

pdf = fitz.Document("sample2.pdf")
page_images = []

for i in range(len(pdf)):
    images = pdf.get_page_images(i)
    hasimage = False if len(images) == 0 else True
    # print(f"{i}: {hasimage}: {images}")
    if hasimage:
        page_images.append(i)

print(f"page_has_image: {page_images}")
print(f"total_page_has_image: {len(page_images)}")
print(f"total_pages: {len(pdf)}")

With result like this:

page_has_image: [0, 8, 17]
total_page_has_image: 3
total_pages: 20

This code work like what I expect with sample.pdf from first 20 pages I make using Print to PDF. But, when I switch to the real PDF, the result broke to:

page_has_image: [0, 1, 2, 3, ..., 1532, 1533, 1534]
total_page_has_image: 1535
total_pages: 1535

I know I new in data mining, so feel free if you have other approach. Thanks.

This is sample PDF from split using PyPDF2-PDFWriter: link-sample

I expect var page_has_image contain list of page with image corresponding to the real PDF

  • The method `get_images()` is not necessarily equal to the number of images **_shown_** by the page - it's rather the images referenced by the page definition: the PDF creator may have decided to list all images of the complete PDF in the definition of each page - and similar variations. Depending on what you really want, you can **either** force **synchronization** of images in the page definition with the actually displayed ones. **_Or_** walk through the PDF objects (not the pages) and count each object that represents an image. Continued ... – Jorj McKie Apr 18 '23 at 14:24
  • **Synchronization** means to execute `page.clean_contents()` before doing `page.get_images()`. – Jorj McKie Apr 18 '23 at 14:26
  • Checked your file: no page is directly displaying any image. Instead, all pages invoke the same Form XObject (xref 8), that displays two image at xrefs 29 and 32. So the only way to see what the page actually displays, you should use the code in the answer. – Jorj McKie Apr 18 '23 at 14:47
  • @JorjMcKie My goal is extract the data on that pages (result from this task), I think its must more fast when I reduce the number of page PDF right? – Alfian Khusnul Apr 18 '23 at 15:26

1 Answers1

0

This is a more complicated example as explained before. The following method will however detect when a page actually invokes the XObject with the images:

for page in doc:
    img_refs = page.get_image_info(xrefs=True)
    if img_refs != []:
        print("Page", page.number, "images:", [i["xref"] for i in img_refs])

        
Page 0 images: [29, 32]
Page 8 images: [29, 32]
Page 17 images: [29, 32]
Jorj McKie
  • 2,062
  • 1
  • 13
  • 17