Python code to extract txt from PDF document

Question

I have been trying to convert some PDFs into .txt, but most sample codes I found online have the same issue: They only convert one page at a time. I am kinda new to python, and I am not finding how to write a substitute for the .GetPage() method to convert the entire document at once. All help is welcomed.

import PyPDF2
 
pdfFileObject = open(r"F:\pdf.pdf", 'rb')
 
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 
print(" No. Of Pages :", pdfReader.numPages)
 
pageObject = pdfReader.getPage(0)
 
print(pageObject.extractText())
 
pdfFileObject.close()

score 2 · Accepted Answer · answered Jan 14 '22 at 22:44

2

You could do this with a for loop. Extract the text from the pages in the loop and append them to a list.

import PyPDF2

pages_text=[]
with open(r"F:\pdf.pdf", 'rb') as pdfFileObject:
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)

    print(" No. Of Pages :", pdfReader.numPages)
    for page in range(pdfReader.numPages):
        pageObject = pdfReader.getPage(page)
        pages_text.append(pageObject.extractText())

print(pages_text)

answered Jan 14 '22 at 22:44

user92234

106
2
10

1

Thanks a bunch, mate! This worked. I will add here that if someone wants to store it as a .txt file they just need to add: lines = pages_text with open('pdf.txt', 'w') as f: for line in lines: f.write(line) f.write('\n') – Francisco Mello Castro Jan 15 '22 at 02:09

Python code to extract txt from PDF document

1 Answers1