Getting unrecognizable words while finding Trigrams from NLTK Collocations

Question

I am using NLTK Collocations to find trigrams and 'training_set' is a string with many lines of text.

 finder = TrigramCollocationFinder.from_words(str(training_set))
 print finder.nbest(trigram_measures.pmi, 5)

But I am getting the output as

 [('\xe5', '\x8d', '\xb8'), ('\xe5', '\x85', '\x8d'), ('\xe2', '\x80', '\x9c'), ('\xe2',    '\x80', '\x9d'), ('\xe2', '\x80', '\xa6')]

Is this some encoding problem? How do I get normal english words?

score 0 · Answer 1 · answered Sep 09 '14 at 16:12

0

Yes, those look like 'windows-1252' encoded characters:

>>> import chardet

>>> chardet.detect('\xe5') {'confidence': 0.5, 'encoding': 'windows-1252'}

So if you don't want those to show up you can do something like this to your text:

>> '\xe5'.decode('windows-1252').encode('ascii', 'ignore')

answered Sep 09 '14 at 16:12

leavesof3

Running the decoding and encoding script is giving an empty string. – Shivendra Sep 10 '14 at 06:11
Well they won't be english word's because they're foreign characters. Just omit the encode portion to get the the actual letters. >>> print '\xe5'.decode('windows-1252') å . It also looks like what you have aren't trigrams of words but of individual letters. You likely have to to tokenize your text before sending it to the TrigramCollocationFinder. – leavesof3 Sep 12 '14 at 03:14
finder =TrigramCollocationFinder.from_words(nltk.word_tokenize(str(training_set))) – leavesof3 Sep 12 '14 at 03:35

1 Answers1