I have a document library which consists of Several Thousand PDF Documents. I am trying to extract the first page from each document. The extracted page should then be stored individually into a folder called "First Page".
I have written the below script as a means of printing the first page from each document. I have been able to extract the PDF files from some of the documents in my library. However, the vast majority have not been exported. Examining terminal, i note that there are a lot of errors thrown with the comment "Superfluous whitespace found in object header b'21' b'0'". I have searched online but am unable to locate anything of relevance.
I have three questions:
Would anyone have any idea how I can address the Superfluous whitespace issue?
My documents seem to be exporting as unreadable or damaged files. Is there something missing from my code?
My documents are also not exporting to my required output directory. I am unsure how I point the extracts to this directory. Would anyone be able to help with this also?
import os
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
# get the file names in the directory
input_directory = 'Fund Docs'
entries = os.listdir(input_directory)
output_directory = 'First Pages'
outputs = os.listdir(output_directory)
for entry in entries:
print(entry)
# create a PDF reader object
pdfFileObj = open(input_directory + '/' + entry, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
outputFileName = 'First_Page' + entry + '.pdf'
with open(outputFileName, 'wb') as out:
pdf_writer = PyPDF2.PdfFileWriter(out)
print('created ', outputFileName)