0

my code is based off of the code at: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

I can run my program with lower number of files, however when I start to get to larger file numbers around 1000, then I get this error:

ReadWrite.py:59: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal stopped_tokens = [i for i in tokens if not i in en_stop]

I was wondering if anyone has run into this before or if anyone has any idea for how to fix this error.

1 Answers1

0

It seems like you're trying to compare variables of different types in list comprehension. en_stop contains unicode variables. I guess, tokens, which you are reading from files, have encoding like utf-8, cp1251, etc. You should try to determine, what kind of encoding your tokens have. You can do it this way:

encoding = 'utf-8' # assign name like 'utf-8', 'cp1251', etc.
string = tokens[0]
try:
    string.decode(encoding)
    print 'string is {}'.format(encoding)
except UnicodeError:
    print 'string is not {}'.format(encoding)

When you find correct encoding, you can get stopped_tokens this way:

stopped_tokens = [i for i in tokens if not unicode(i, encoding) in en_stop]

unicode(i, encoding) should convert your tokens to unicode representation in your list comprehension.

Eduard Ilyasov
  • 3,268
  • 2
  • 20
  • 18
  • I took your advice and I check to make sure that the files I have are utf-8. However when I run the code changes you suggested my error changes to UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128). Is this because I made the files into utf-8? – Reighr Doughty Feb 15 '17 at 01:41