3

I am trying to print text from pdf using PyPDF2. Here is my code:

import PyPDF2
pdf_file = open('report.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(1)
page_content = page.extractText()
print (page_content.encode('utf-8'))

In result I am getting empty line with some warning.

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
b''

I have checked that this warning itself does not impact the results but in my case I am getting nothing. Any suggestions. Thanks

muazfaiz
  • 4,611
  • 14
  • 50
  • 88

2 Answers2

0

Try updating to a newer version. We improved the text extraction support of PyPDF2 a lot in the past months.

It might also be that there is no text, but only an image of text. PyPDF2 is not ocr software. Tesseract-ocr is.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
-1

Try changing your code like this:

import PyPDF2
pdf_file = open('report.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page_content = read_pdf.getPage(1).extractText()
print (page_content.encode('utf-8','strict'))
James C. Taylor
  • 430
  • 3
  • 8