8

I'm opening a lot of PDF's and I want to delete the PDF's after they have been parsed, but the files remain open until the program is done running. How do I close the PDf's I open using PyPDF2?

Code:

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = PyPDF2.PdfFileReader(file(path, "rb"))

    #Check for number of pages, prevents out of bounds errors
    max = 0
    if pdf.numPages > 3:
        max = 3
    else:
        max = (pdf.numPages - 1)

    # Iterate pages
    for i in range(0, max): 
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    #pdf.close()
    return content
Håken Lid
  • 22,318
  • 9
  • 52
  • 67
SPYBUG96
  • 1,089
  • 5
  • 20
  • 38

3 Answers3

10

just open and close the file yourself

f = open(path, "rb")
pdf = PyPDF2.PdfFileReader(f)
f.close()

PyPDF2 .read()s the stream that you pass in, right in the constructor. So after the initial object construction, you can just toss the file.

A context manager will work, too:

with open(path, "rb") as f:
    pdf = PyPDF2.PdfFileReader(f)
do_other_stuff_with_pdf(pdf)
Him
  • 5,257
  • 3
  • 26
  • 83
2

When doing this:

pdf = PyPDF2.PdfFileReader(file(path, "rb"))

you're pasing a reference to a handle, but you have no control on when the file will be closed.

You should create a context with the handle instead of passing it anonymously from here:

I would write

with open(path,"rb") as f:

    pdf = PyPDF2.PdfFileReader(f)
    #Check for number of pages, prevents out of bounds errors
    ... do your processing
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
# now the file is closed by exiting the block, you can delete it
os.remove(path)
# and return the contents
return content
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
2

Yes, you are passing in the stream to PdfFileReader and you can close it. The with syntax is preferable to do that for you:

def getPDFContent(path):
    with open(path, "rb") as f:
        content = ""
        # Load PDF into pyPDF
        pdf = PyPDF2.PdfFileReader(f)

        #Check for number of pages, prevents out of bounds errors
        max = 0
        if pdf.numPages > 3:
            max = 3
        else:
            max = (pdf.numPages - 1)

        # Iterate pages
        for i in range(0, max): 
            # Extract text from page and add to content
            content += pdf.getPage(i).extractText() + "\n"
        # Collapse whitespace
        content = " ".join(content.replace(u"\xa0", " ").strip().split())
        return content
de1
  • 2,986
  • 1
  • 15
  • 32