-1

I am trying to import a csv file in order to train my classifier but I keep receiving this error

traceback (most recent call last):
File "updateClassif.py", line 17, in <module>
myClassif = NaiveBayesClassifier(fp, format="csv")
  File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 191, in __init__
    super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
  File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 123, in __init__
    self.train_set = self._read_data(train_set, format)
  File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 143, in _read_data
    return format_class(dataset, **self.format_kwargs).to_iterable()
  File "C:\Python27\lib\site-packages\textblob\formats.py", line 68, in __init__
    self.data = [row for row in reader]
  File "C:\Python27\lib\site-packages\textblob\unicodecsv\__init__.py", line 106, in next
    row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 55: ordinal not in range(128)

The CSV file contains 1600000 lines of tweets so I believe some tweets contain special characters. I have tried saving it using open office as someone recommended but still the same result. I also tried using latin encoding but the same result. This is my code :

with codecs.open('tr.csv', 'r' ,encoding='latin-1') as fp:
myClassif = NaiveBayesClassifier(fp, format="csv")

This is the code from the library I am using:

def __init__(self, csvfile, fieldnames=None, restkey=None, restval=None,
                 dialect='excel', encoding='utf-8', errors='strict', *args,
                 **kwds):
        if fieldnames is not None:
            fieldnames = _stringify_list(fieldnames, encoding)
        csv.DictReader.__init__(self, csvfile, fieldnames, restkey, restval, dialect, *args, **kwds)
        self.reader = UnicodeReader(csvfile, dialect, encoding=encoding,
                                    errors=errors, *args, **kwds)
        if fieldnames is None and not hasattr(csv.DictReader, 'fieldnames'):
            # Python 2.5 fieldnames workaround. (http://bugs.python.org/issue3436)
            reader = UnicodeReader(csvfile, dialect, encoding=encoding, *args, **kwds)
            self.fieldnames = _stringify_list(reader.next(), reader.encoding)
        self.unicode_fieldnames = [_unicodify(f, encoding) for f in
                                   self.fieldnames]
        self.unicode_restkey = _unicodify(restkey, encoding)

    def next(self):
        row = csv.DictReader.next(self)
        result = dict((uni_key, row[str_key]) for (str_key, uni_key) in
                      izip(self.fieldnames, self.unicode_fieldnames))
        rest = row.get(self.restkey)
Pca
  • 7
  • 7
  • Please post the **full text** of the traceback. Also, please indicate which version of Python you are using. – MattDMo Mar 05 '16 at 19:37
  • Its likely utf=8 encoded. Try that. – tdelaney Mar 05 '16 at 19:39
  • @tdelaney I have tried with utf=8 and it's returning me this : "UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 35: invalid continuation byte " – Pca Mar 05 '16 at 19:52
  • @MattDMo I have posted the full text of the traceback and I am using Python 2.7 – Pca Mar 05 '16 at 19:55
  • You should be able to figure out the line and then `print repr(line)`. Post that and maybe we can guess. Are you on windows? Maybe it saved as a windows code page. I'm not sure how you got the file in the first place, but using `sys.stdout.encoding` may help. – tdelaney Mar 05 '16 at 19:57
  • 1
    Possible duplicate of [Python ASCII codec can't encode character error during write to CSV](http://stackoverflow.com/questions/32939771/python-ascii-codec-cant-encode-character-error-during-write-to-csv) – Alastair McCormack Mar 06 '16 at 18:23

2 Answers2

0

Note that the traceback says EncodeError, not DecodeError. It looks like the NaiveBayesClassifier is expecting ascii. Either make it accept Unicode, or, if this is OK for your application, replace non-ascii characters with '?' or something.

  • I have attached above the code from the library I am using, should I change the " encoding = utf-8 " ? – Pca Mar 05 '16 at 20:28
  • How do you initialize your library? It accepts an `encoding` argument. Did you try setting it to latin1? – Jon Kåre Hellan Mar 05 '16 at 20:46
  • I am using a classifier provided by this library and I am just importing the library :from textblob.classifiers import NaiveBayesClassifier from textblob import TextBlob I have tried now using latin1 and same error. Is there an encoding which contains all the characters including special ones ? – Pca Mar 05 '16 at 20:53
  • Hmm. Can you convert the input data to UTF-8 once and for all, or does it keep changing? – Jon Kåre Hellan Mar 05 '16 at 21:25
0

In Python2, the csv module does not support unicode. So you must pass in some kind of iterator object (such as a file) which only produces byte-strings.

This means that your code should look like this:

with open('tr.csv', 'rb') as fp:
    myClassif = NaiveBayesClassifier(fp, format="csv")

But note that the csv file must be encoded as UTF-8. If it's not, you will obviously need to convert it to UTF-8 first, in order for the code above to work.

ekhumoro
  • 115,249
  • 20
  • 229
  • 336