I want to tokenize some texts in portuguese. I think I'm doing almost everything right, but I have a problem that I couldn't realize what could be wrong. I'm trying this code:
text = '''Família S.A. dispõe de $12.400 milhões para concorrência. A
âncora desse négócio é conhecida no coração do Órgão responsável. '''
pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens; includes ], [
'''
print nltk.regexp_tokenize(text, pattern,flags=re.UNICODE)
And get this result:
['Fam\xc3', 'lia', 'S.A.', 'disp\xc3\xb5e', 'de', '$12.400', 'milh\xc3\xb5es', 'para', 'concorr\xc3\xaancia', '.', 'A', '\xc3', 'ncora', 'desse', 'n\xc3', 'g\xc3\xb3cio', '\xc3', 'conhecida', 'no', 'cora\xc3', '\xc3', 'o', 'do', '\xc3', 'rg\xc3', 'o', 'respons\xc3', 'vel', '.']
It does the job as expected in some terms, but splits others like ['Família' = 'Fam\xc3','lia'] or ['coração' = 'cora\xc3', '\xc3', 'o'].
Any help?