0

I need to read this PDF.

I am using the following code:

from PyPDF2 import PdfFileReader

f = open('myfile.pdf', 'rb')
reader = PdfFileReader(f)
content = reader.getPage(0).extractText()
f.close()
content = ' '.join(content.replace('\xa0', ' ').strip().split())

print(content)

However, the encoding is incorrect, it prints:

Resultado da Prova de Sele“‰o do...

But I expected

Resultado da Prova de Seleção do...

How to solve it?

I'm using Python 3

macabeus
  • 4,156
  • 5
  • 37
  • 66

2 Answers2

2

The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.

# -*- coding: utf-8 -*-
correct = u'Resultado da Prova de Seleção do...'
print(correct.encode(encoding='utf-8'))

You're on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.

# Show installed locales
import locale
from pprint import pprint
pprint(locale.locale_alias)

If that's not the quick fix, since you're getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It's possible that PyPDF wasn't able to determine the correct encoding and gave you the wrong characters.

For example, a quick and dirty comparison of the good and bad strings you posted:

# -*- coding: utf-8 -*-
# Python 3.4
incorrect = 'Resultado da Prova de Sele“‰o do'
correct = 'Resultado da Prova de Seleção do...'

print("Incorrect String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in incorrect:
    print(
        '{}{}{}'.format(
            char.encode(encoding='utf-8'),
            ' ' * 20,  # Hack; Byte objects don't have __format__
            ord(char)
        )
    )

print("\n" * 2)

print("Correct String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in correct:
    print(
        '{}{}{}'.format(
            char.encode(encoding='utf-8'),
            ' ' * 20,  # Hack; Byte objects don't have __format__
            ord(char)
        )
    )

Relevant Output:

b'\xe2\x80\x9c' 8220
b'\xe2\x80\xb0' 8240

b'\xc3\xa7' 231
b'\xc3\xa3' 227

If you're getting code point 231, (>>>hex(231) # '0xe7) then you're getting back bad data back from PyPDF.

Community
  • 1
  • 1
Michelle Welcks
  • 3,513
  • 4
  • 21
  • 34
  • Really, the problem is in PyPDF. As I can not solve the problem, [I sent a message on GitHub](https://github.com/mstamy2/PyPDF2/issues/235) – macabeus Nov 13 '15 at 03:56
0

what i have tried is to replace specific " ' " unicode with "’" which solves this issue. Please let me know if u still failed to generate pdf with this approach.

text = text.replace("'", "’")
Tony Aziz
  • 899
  • 6
  • 4