I get PDF from other Department with huge pages (like 1500). This PDF is compilation of subdistrict documents in a district. To make sure of this data, I want to extract data from this PDF. First try, I use PDFMiner to extract the text but this approach cost much time.
My Strategy is to reduce the pages size with extract the first of every subdistrict document in the PDF which is the resume page. I can identify this page because every this page has image. To extract the page, I need the index page with pictures (var page_images). So, I write my code like this:
import fitz
pdf = fitz.Document("sample2.pdf")
page_images = []
for i in range(len(pdf)):
images = pdf.get_page_images(i)
hasimage = False if len(images) == 0 else True
# print(f"{i}: {hasimage}: {images}")
if hasimage:
page_images.append(i)
print(f"page_has_image: {page_images}")
print(f"total_page_has_image: {len(page_images)}")
print(f"total_pages: {len(pdf)}")
With result like this:
page_has_image: [0, 8, 17]
total_page_has_image: 3
total_pages: 20
This code work like what I expect with sample.pdf from first 20 pages I make using Print to PDF. But, when I switch to the real PDF, the result broke to:
page_has_image: [0, 1, 2, 3, ..., 1532, 1533, 1534]
total_page_has_image: 1535
total_pages: 1535
I know I new in data mining, so feel free if you have other approach. Thanks.
This is sample PDF from split using PyPDF2-PDFWriter: link-sample
I expect var page_has_image contain list of page with image corresponding to the real PDF