2

I found the following code that allows one to extract text from a pdf file. However, this only works for pdf's where you can copy the text directly from highlighting it. I'm curious if there's some way to extract text from a document where you can't select the text in Python, like a photocopy or scanned document saved as a pdf?

Here is the code I use to take in text from a non-photocopy pdf file

import PyPDF2
pdfFileObject = open('C:\\filepath\\file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages

output = ''
for i in range(count):
    page = pdfReader.getPage(i)
    output += page.extractText()
    # output.append(page.extractText())
    
output

Works like a charm. However I am curious about a way to extract text from a photocopied document saved as a pdf.

Extracting text from a photocopied document saved as a pdf doesn't work when I use the code provided above.

1 Answers1

1

PyPDF2 is a PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well. To retrieve text from a scanned document, you need to do an OCR (Optical Character Recognition (OCR) is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file.) There are various OCR tools and pytesseract is one of mostly used tools. Read more about those OCR tools from here.