Unable To Convert PDF to Text format

Question

I am getting this error while parsing the PDF file using pypdf2 i am attaching PDF along with the error.

I have attached the PDF to be parsed please click to view

Can anyone help?

import PyPDF2


def convert(data):

   pdfName = data
   read_pdf = PyPDF2.PdfFileReader(pdfName)
   page = read_pdf.getPage(0)
   page_content = page.extractText()
   print(page_content)
   return (page_content)

error:

PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.

Your file is a scanned document. You should use OCR functionality to get the text out of this. — saeed, Apr 13 '19 at 19:29

saeed · Answer 1 · 2019-04-13T20:24:50.847

There are some open source OCR tools like tesseract or openCV.

If you want to use e.g. tesseract there is a python wrapper library called pytesseract.

Most of OCR tools work on images, so you have to first convert your PDF into an image file format like PNG or JPG. After this you can load your image and process it with pytesseract.

Here is some sample code how you can use pytesseract, let's suppose you have already converted your PDF to an image with filename pdfName.png:

from PIL import Image 
import pytesseract

def ocr_core(filename):  
    """
    This function will handle the core OCR processing of images.
    """
    text = pytesseract.image_to_string(Image.open(filename))  # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
    return text

print(ocr_core('pdfName.png'))

Unable To Convert PDF to Text format

1 Answers1