ValueError: seek of closed file Working on PyPDF2 and getting this error

Question

I am trying to get text out of a pdf file. Below is the code:

from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf', 'rb') as file:
    pdf = PdfFileReader(file)

page = pdf.getPage(1)
#print(dir(page))
print(page.extractText())

This gives me the error

ValueError: seek of closed file

I just put the code under the with statement, and it works fine. My question is: why is this so? I have already stored the information in 'pdf' object so i should be able to access it outside the block.

Please show a traceback and the actual code that shows the second error — Mad Physicist, May 05 '19 at 11:26
Thanks. Looks like your pdf package doesn't support Python 3 — Mad Physicist, May 05 '19 at 11:33
But i just ran same code by changing pdf file. It gives some results with error: ```PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]``` — Jeet Singh, May 05 '19 at 11:37
That's a reasonable warning. The error in the question looks like what you get when you expect a python 2 bytes, which is str, but run in python 3, where bytes are integers. Not sure how you got that. — Mad Physicist, May 05 '19 at 11:39
Are you running in different virtual environments? Have different versions of Python or PyPDF installed? What versions do you have? — Mad Physicist, May 05 '19 at 11:40
only single version of each ```$ py --version Python 3.7.2 ``` and ```$ pip freeze | grep PyPDF2 PyPDF2==1.26.0 ``` — Jeet Singh, May 05 '19 at 11:45
Can you attach a sample of the pdf or a link to it here please? — Mad Physicist, May 05 '19 at 12:16
i tried with another pdf and results are as expected so i guess i should change the module/ method. Can't rely on it. https://www.w3.org/Protocols/HTTP/1.1/rfc2616.pdf this is the link btw — Jeet Singh, May 05 '19 at 12:21
Could you split your question into two please? I would be happy to post an answer to part 1 here, but could you make part 2, with the link above into a separate question? — Mad Physicist, May 05 '19 at 12:47
Where is part 2? I'll take a look at it later today if I get a chance. — Mad Physicist, May 05 '19 at 20:02
@MadPhysicist here's part2 https://stackoverflow.com/questions/55993860/getting-typeerror-ord-expected-string-of-length-1-but-int-found-error — Jeet Singh, May 06 '19 at 04:52

score 14 · Accepted Answer · answered May 05 '19 at 19:59

PdfFileReader expects a seekable, open, steam. It does not load the entire file into memory, so you have to keep it open to run the methods, like getPage. Your hypothesis that creating a reader automatically reads in the whole file is incorrect.

A with statement operates on a context manager, such as a file. When the with ends, the context manager's __exit__ method is called. In this case, it closes the file handle that your PdfFildReader is trying to use to get the second page.

As you found out, the correct procedure is to read what you must from the PDF before you close the file. If, and only if, your program needs the PDF open until the very end, you can pass the file name directly to PdfFileReader. There is no (documented) way to close the file after that though, so I would recommend your original approach:

from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf', 'rb') as file:
    pdf = PdfFileReader(file)
    page = pdf.getPage(1)
    print(page.extractText())
# file is closed here, pdf will no longer do its job

I think you are right. PdfFileReader doesn't store anything. It needs an open file till its task is complete. Although we can extract data and use it later. That is stored in the memory. .```with open('test_2.pdf','rb') as file: pdf=PdfFileReader(file) page=pdf.getPage(2) data=page.extractText() print(data[:40])``` — Jeet Singh, May 06 '19 at 04:43

score 0 · Answer 2 · answered Apr 28 '23 at 16:12

0

I had the same error, try to indent the last lines into the with section. Work for me after 2 days of searching.

from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf', 'rb') as file:
    pdf = PdfFileReader(file)

    page = pdf.getPage(1)
    #print(dir(page))
    print(page.extractText())

answered Apr 28 '23 at 16:12

Masutier

9
2

Technically correct, but what does this add beyond the accepted answer? – Mad Physicist Jun 15 '23 at 15:32

ValueError: seek of closed file Working on PyPDF2 and getting this error

2 Answers2