Extract First Page from Multiple PDF Documents in Python

Question

I have a document library which consists of Several Thousand PDF Documents. I am trying to extract the first page from each document. The extracted page should then be stored individually into a folder called "First Page".

I have written the below script as a means of printing the first page from each document. I have been able to extract the PDF files from some of the documents in my library. However, the vast majority have not been exported. Examining terminal, i note that there are a lot of errors thrown with the comment "Superfluous whitespace found in object header b'21' b'0'". I have searched online but am unable to locate anything of relevance.

I have three questions:

Would anyone have any idea how I can address the Superfluous whitespace issue?
My documents seem to be exporting as unreadable or damaged files. Is there something missing from my code?
My documents are also not exporting to my required output directory. I am unsure how I point the extracts to this directory. Would anyone be able to help with this also?

import os
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader

# get the file names in the directory
input_directory = 'Fund Docs'
entries = os.listdir(input_directory)
output_directory = 'First Pages'
outputs = os.listdir(output_directory)


for entry in entries:
    print(entry)
    # create a PDF reader object
    pdfFileObj = open(input_directory + '/' + entry, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    # print(pdfReader.numPages)
    # creating a page object
    pageObj = pdfReader.getPage(0)
    # extracting text from page
    print(pageObj.extractText())
    # closing the pdf file object
    pdfFileObj.close()

    outputFileName = 'First_Page' + entry + '.pdf'
    with open(outputFileName, 'wb') as out:
        pdf_writer = PyPDF2.PdfFileWriter(out)

        print('created ', outputFileName)

Martin Thoma · Answer 1 · 2023-01-10T12:08:18.887

Several points:

Use pypdf (PyPDF2 is deprecated)
Use PdfReader (PdfFileReader is deprecated - it now has strict=False by default)
The Superfluous whitespace found message is only a warning with strict=False. You see that message because the PDF is not completely standard compliant. You can silence the warning: https://pypdf.readthedocs.io/en/latest/user/suppress-warnings.html
When you write a question, you should also mention which version of the critical libraries you're using.

Extract First Page from Multiple PDF Documents in Python

1 Answers1