Python: Encoding characters but still work with the list

Question

So for a text mining assignment we try to collect tweets (especially the texts) and run the stanford NER tagger to find out if there are persons or locations mentioned. This could also be done by checking the hashtags, but the idea is to use some text mining tools.

so let's say that we have data loaded from a cPickle file which is saved, loaded and split on white space.

hil_text = [[u'Man', u'is', u'not', u'a', u'issue', u'cah', u'me', u'pum', u'pum', u'tun', u'up', u'#InternationalWomensDay', u'#cham', u'#empowerment', u'#Clinton2016', u'#PiDay2016'], [u'Shonda', u'Land', u'came', u'out', u'with', u'a', u'great', u'ad', u'for', u'Clinton:https://t.co/Vfg9lAKNaH#Clinton2016'], [u'RT', u'@BeaverforBernie:', u'Trump', u'and', u'the', u"#Clinton's", u'are', u'the', u'same.', u'They', u'worship', u'$$$$$.', u'https://t.co/yUXoJaL6mJ'], [u'.@GloriaLaRiva', u'on', u'#Clinton,', u'Reagans', u'&amp;', u'#AIDS:', u'\u201cClinton', u'just', u're-wrote', u'history\u201d', u'https://t.co/L3YuIyFjxo', u'Clinton', u'incapable', u'of', u'telling', u'truth.'], [u'#KKK', u'Leader', u'Gets', u'Behind', u'This', u'Democratic', u'Candidate', u'https://t.co/p9yTQ2sXmV', u'How', u'fitting!', u'#Hillary2016', u'#HillaryClinton', u'#Hillary', u'#Killary', u'#tcot'], [u'#KKK', u'Leader', u'Gets', u'Behind', u'This', u'Democratic', u'Candidate', u'https://t.co/p9yTQ2sXmV', u'How', u'fitting!', u'#Hillary2016', u'#HillaryClinton', u'#Hillary', u'#Killary', u'#tcot'], [u'RT', u'@jvlibrarylady:', u'President', u'Clinton', u'at', u'rally', u'for', u'Hillary', u'at', u'Teamsters', u'Local', u'245', u'in', u'Springfield,', u'Mo.', u'#HillaryClintonForPresident', u'https://t.\u2026'], [u'RT', u'@jvlibrarylady:', u'President', u'Clinton', u'at', u'rally', u'for', u'Hillary', u'at', u'Teamsters', u'Local', u'245', u'in', u'Springfield,', u'Mo.', u'#HillaryClintonForPresident', u'https://t.\u2026']]

The tagger doesn't accept the unicode, so in trying to get it to work we tried to do the following.

for word in hil_text:
    for x in word:
        print x.encode('utf-8',errors='ignore')
        print tagger.tag(x.encode('utf-8',errors='ignore')

This results in x being the word printed, but the tagger tagging each letter separately.

Is there a way to encode it and send it through the tagger as a word? Or in other words to encode parts of a list but still keep that part in a list?

And why does the tagger tag each letter and not just the whole x?

score 0 · Accepted Answer · answered Mar 14 '16 at 22:31

It looks like tagger.tag is expecting a sequence of strings. But you are passing in a single string, which python will treats as sequence of characters. To fix that, try this:

for section in hil_text:
    # encode each word in the section, and put them in a new list
    words = [word.encode('utf-8') for word in section]
    # pass the list of encoded words to the tagger
    print tagger.tag(words)

Python: Encoding characters but still work with the list

1 Answers1