10

I am trying to get text out of a pdf file. Below is the code:

from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf', 'rb') as file:
    pdf = PdfFileReader(file)

page = pdf.getPage(1)
#print(dir(page))
print(page.extractText())

This gives me the error

ValueError: seek of closed file

I just put the code under the with statement, and it works fine. My question is: why is this so? I have already stored the information in 'pdf' object so i should be able to access it outside the block.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Jeet Singh
  • 303
  • 1
  • 2
  • 10

2 Answers2

14

PdfFileReader expects a seekable, open, steam. It does not load the entire file into memory, so you have to keep it open to run the methods, like getPage. Your hypothesis that creating a reader automatically reads in the whole file is incorrect.

A with statement operates on a context manager, such as a file. When the with ends, the context manager's __exit__ method is called. In this case, it closes the file handle that your PdfFildReader is trying to use to get the second page.

As you found out, the correct procedure is to read what you must from the PDF before you close the file. If, and only if, your program needs the PDF open until the very end, you can pass the file name directly to PdfFileReader. There is no (documented) way to close the file after that though, so I would recommend your original approach:

from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf', 'rb') as file:
    pdf = PdfFileReader(file)
    page = pdf.getPage(1)
    print(page.extractText())
# file is closed here, pdf will no longer do its job
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • 1
    I think you are right. PdfFileReader doesn't store anything. It needs an open file till its task is complete. Although we can extract data and use it later. That is stored in the memory. .```with open('test_2.pdf','rb') as file: pdf=PdfFileReader(file) page=pdf.getPage(2) data=page.extractText() print(data[:40])``` – Jeet Singh May 06 '19 at 04:43
0

I had the same error, try to indent the last lines into the with section. Work for me after 2 days of searching.

from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf', 'rb') as file:
    pdf = PdfFileReader(file)

    page = pdf.getPage(1)
    #print(dir(page))
    print(page.extractText())
Masutier
  • 9
  • 2