PyPDF2 not printing any output from the text

Question

I am trying to print text from pdf using PyPDF2. Here is my code:

import PyPDF2
pdf_file = open('report.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(1)
page_content = page.extractText()
print (page_content.encode('utf-8'))

In result I am getting empty line with some warning.

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
b''

I have checked that this warning itself does not impact the results but in my case I am getting nothing. Any suggestions. Thanks

score 0 · Answer 1 · answered Jul 30 '22 at 10:29

0

Try updating to a newer version. We improved the text extraction support of PyPDF2 a lot in the past months.

It might also be that there is no text, but only an image of text. PyPDF2 is not ocr software. Tesseract-ocr is.

answered Jul 30 '22 at 10:29

Martin Thoma

124,992
159
614
958

score -1 · Answer 2 · answered Jun 06 '17 at 17:14

-1

Try changing your code like this:

import PyPDF2
pdf_file = open('report.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page_content = read_pdf.getPage(1).extractText()
print (page_content.encode('utf-8','strict'))

answered Jun 06 '17 at 17:14

James C. Taylor

430
3
8

Nope. I still just get this `b'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'` – DollarAkshay Aug 19 '18 at 13:17

PyPDF2 not printing any output from the text

2 Answers2