I found the following code that allows one to extract text from a pdf file. However, this only works for pdf's where you can copy the text directly from highlighting it. I'm curious if there's some way to extract text from a document where you can't select the text in Python, like a photocopy or scanned document saved as a pdf?
Here is the code I use to take in text from a non-photocopy pdf file
import PyPDF2
pdfFileObject = open('C:\\filepath\\file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
output = ''
for i in range(count):
page = pdfReader.getPage(i)
output += page.extractText()
# output.append(page.extractText())
output
Works like a charm. However I am curious about a way to extract text from a photocopied document saved as a pdf.
Extracting text from a photocopied document saved as a pdf doesn't work when I use the code provided above.