1

I want to tokenize some texts in portuguese. I think I'm doing almost everything right, but I have a problem that I couldn't realize what could be wrong. I'm trying this code:

    text = '''Família S.A. dispõe de $12.400 milhões para concorrência. A 
âncora desse négócio é conhecida no coração do Órgão responsável. '''
    pattern = r'''(?x)    # set flag to allow verbose regexps
         ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
       | \w+(-\w+)*        # words with optional internal hyphens
       | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
       | \.\.\.            # ellipsis
       | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
       '''

    print nltk.regexp_tokenize(text, pattern,flags=re.UNICODE)

And get this result:

['Fam\xc3', 'lia', 'S.A.', 'disp\xc3\xb5e', 'de', '$12.400', 'milh\xc3\xb5es', 'para', 'concorr\xc3\xaancia', '.', 'A', '\xc3', 'ncora', 'desse', 'n\xc3', 'g\xc3\xb3cio', '\xc3', 'conhecida', 'no', 'cora\xc3', '\xc3', 'o', 'do', '\xc3', 'rg\xc3', 'o', 'respons\xc3', 'vel', '.']

It does the job as expected in some terms, but splits others like ['Família' = 'Fam\xc3','lia'] or ['coração' = 'cora\xc3', '\xc3', 'o'].

Any help?

Marcelo
  • 438
  • 5
  • 16
  • Did you try `for w in text.split(): print w` ? – kums Oct 16 '14 at 22:17
  • @kums, I know split() function does the job, but it also fails when we have punctuation like ". , ; ?". Besides, I want to get this regex solution, because it seems to be, IMO, very flexible. – Marcelo Oct 17 '14 at 12:01
  • What encoding are you using? Your code works fine for me when I run it in a gui with utf-8 set as the default encoding. Your problem seems to be an encoding problem rather than a problem with your code per se. – Justin O Barber Oct 20 '14 at 01:31
  • Thanx @JustinBarber, I was trying 'utf8' set as the default encoding, but your commentary gave me an idea that solves the problem. – Marcelo Oct 20 '14 at 11:15
  • Glad you got it figured out! Nice work. – Justin O Barber Oct 20 '14 at 12:18

1 Answers1

2

In case of someone has the same problem I had, just change the default enconding. For portuguese, I'm using 'latin-1' set and also decoding with it when printing the words in order to get the right characters. Check this out:

#!/usr/bin/env python
# -*- coding:  latin-1 -*-
""" Spliting text in portuguese (enconding 'latin-1') using regex. 
"""
import nltk
import re

print "\n****** Using Regex to tokenize ******"
text = '''Família-Empresa S.A. dispõe de $12.400 milhões para concorrência. A 
âncora, desse negócio, é conhecida no coração do Órgão responsável. '''
pattern = r'''(?x)    # set flag to allow verbose regexps
     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.            # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
    '''
result = nltk.regexp_tokenize(text, pattern, flags=re.UNICODE) 
for w in result:
    print w.decode('latin-1')

print result

the result is:

****** Using Regex to tokenize ******
Família-Empresa
S.A.
dispõe
de
$12.400
milhões
para
concorrência
.
A
âncora
,
desse
negócio
,
é
conhecida
no
coração
do
Órgão
responsável
.
['Fam\xedlia-Empresa', 'S.A.', 'disp\xf5e', 'de', '$12.400', 'milh\xf5es', 'para', 'concorr\xeancia', '.', 'A', '\xe2ncora', ',', 'desse', 'neg\xf3cio', ',', '\xe9', 'conhecida', 'no', 'cora\xe7\xe3o', 'do', '\xd3rg\xe3o', 'respons\xe1vel', '.']

Thanx to @JustinBarber for the commentary that provided some clues to solve the problem.

That's all folks!

Marcelo
  • 438
  • 5
  • 16