The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.
# -*- coding: utf-8 -*-
correct = u'Resultado da Prova de Seleção do...'
print(correct.encode(encoding='utf-8'))
You're on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.
# Show installed locales
import locale
from pprint import pprint
pprint(locale.locale_alias)
If that's not the quick fix, since you're getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It's possible that PyPDF wasn't able to determine the correct encoding and gave you the wrong characters.
For example, a quick and dirty comparison of the good and bad strings you posted:
# -*- coding: utf-8 -*-
# Python 3.4
incorrect = 'Resultado da Prova de Sele“‰o do'
correct = 'Resultado da Prova de Seleção do...'
print("Incorrect String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in incorrect:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
print("\n" * 2)
print("Correct String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in correct:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
Relevant Output:
b'\xe2\x80\x9c' 8220
b'\xe2\x80\xb0' 8240
b'\xc3\xa7' 231
b'\xc3\xa3' 227
If you're getting code point 231, (>>>hex(231) # '0xe7
) then you're getting back bad data back from PyPDF.