How do I extract the text of a single page with PyPDF2?

Question

I have a document library which consists of several hundred PDF Documents. I am attempting to export the first page of each PDF document. Below is my script which extracts the page. It saves each page as an individual PDF. However, the files which are exported seem to be exporting in unreadable or damaged format.

Is there something missing from my script?

import os
from PyPDF2 import PdfReader, PdfWriter

# get the file names in the directory
input_directory = "Fund_Docs_Sample"
entries = os.listdir(input_directory)
output_directory = "First Pages"
outputs = os.listdir(output_directory)

for output_file_name in entries:
    reader = PdfReader(input_directory + "/" + output_file_name)
    page = reader.pages[0]
    first_page = "\n" + page.extract_text() + "\n"

    with open(output_file_name, "wb") as outputStream:
        pdf_writer = PdfWriter(output_file_name + first_page)

In my case it's EOF marker not found. `Traceback (most recent call last): ... File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/PyPDF2/_reader.py", line 1465, in _find_eof_marker raise PdfReadError("EOF marker not found") PyPDF2.errors.PdfReadError: EOF marker not found` ***In MY case only.*** Try recognize your original PDF file? — Misinahaiya, Dec 21 '22 at 11:43
When I read in one file from the document library, the same occurs. It exports the first page as a PDF but the PDF is unreadable or damaged. — Data_Science_Mick, Dec 21 '22 at 11:58
You were not actually exporting a PDF, but text. Or the file was empty. There were too many issues to pin-point the exact problem of your code. The example in my answer should help — Martin Thoma, Dec 23 '22 at 11:21

trolloldem · Answer 1 · 2022-12-22T20:26:01.543

I think the cause coul be the fact that in your code is missing the method call addPage(page) in whitch you specify the contents of the first page of the output file. The code that you need is similar to the one proposed in the answer to this question. In particular, a possible solution could look like this:

import os
from PyPDF2 import PdfReader, PdfWriter

# get the file names in the directory
input_directory = 'Fund_Docs_Sample'
entries = os.listdir(input_directory)
output_directory = 'First Pages'

for entry in entries:
    # create a PDF reader object
    with open(f"{input_directory}/{entry}", 'rb') as infile:
        reader = PdfReader(infile)
        writer = PdfWriter()
        writer.addPage(reader.getPage(0))

        with open(f"{output_directory}/{entry}", 'wb') as outfile:
            writer.write(outfile)

With this code, the names of the PDFs with only the first page will be the same of the corresponding original PDF but they will be located in the output_directory

Martin Thoma · Accepted Answer · 2022-12-23T11:05:12.223

0

You're missing pdf_writer.write(outputStream)
Do you want to write a text file (containing the extracted text) or a PDF file (containing the first page of the PDF)?
You seem to overwrite the files of the input
output_directory is not used at all

After reading the comments, you likely want this:

from pathlib import Path
from PyPDF2 import PdfReader

# get the file names in the directory
input_directory = Path("Fund_Docs_Sample")
output_directory = Path("First Pages")

for input_file_path in input_directory.glob("*.pdf"):
    print(input_file_path)
    reader = PdfReader(input_file_path)
    page = reader.pages[0]
    first_page_text = "\n" + page.extract_text() + "\n"
    
    # create the output text file path
    output_file_path = output_directory / f"{input_file_path.name}.txt"
    
    # write the text to the output file
    with open(output_file_path, "w") as output_file:
        output_file.write(first_page_text)

edited Dec 23 '22 at 11:05

answered Dec 21 '22 at 23:58

Martin Thoma

124,992
159
614
958

Thanks Martin. I was hoping either extract the first page as a pdf and then later extract the text from each page. However, extracting the text from each first page into a text file would work. – Data_Science_Mick Dec 22 '22 at 13:14
You don't need to extract the first page as a pdf first. you can just get the text directly: `text = reader.pages[0].extract_text()`. Then you don't need PdfWriter. Just write `text` to a text file. – Martin Thoma Dec 22 '22 at 20:08
@Data_Science_Mick I've added example code that is likely as you want it. If you want to find files recursively, adjust the glob-pattern to `"**/*.pdf` – Martin Thoma Dec 23 '22 at 11:07
Thank you very much Martin!! I really appreciate your help with this. That has worked perfectly. I also got this to work via updated code. However your code is tidier and more correct. I really appreciate your help. – Data_Science_Mick Dec 23 '22 at 12:57

How do I extract the text of a single page with PyPDF2?

2 Answers2

Linked