0

I'm trying to read a csv file containing foreign characters (french accents at the moment but will be russian as well in the future). Is there a way to read these csvs without removing/replacing the foreign characters?

Whenever I try: pd.read_csv('filename.csv', encoding='utf-8'), it fails to find any columns.

So I tried this:

with codecs.open('filename.csv', 'r') as f:
            for line in f.readlines():
                print line

It just outputs [Decode error - output not utf-8] for some lines (the ones with áéí etc.)

I have also tried the suggestion below to get the encoding for the file and when I read the file with the correct encoding, I get UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128) or something similar!

Any ideas? Thanks in advance

Lucidnonsense
  • 1,195
  • 3
  • 13
  • 35
  • 1
    Probably your data simply isn't UTF-8. – Sven Marnach Sep 03 '14 at 13:27
  • I'd first check the encoding of your csv file using the file command. `file filename.csv` would output something like this: filename.csv: ASCII text, with CRLF line terminators. – Adam Papai Sep 03 '14 at 13:30
  • Check the [BOM](http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding) if there is one – EdChum Sep 03 '14 at 13:31
  • What is this file command? Is it a python statement or in command prompt? How do I see the BOM? – Lucidnonsense Sep 03 '14 at 13:40
  • Open the file in a hex editor or read the file in byte mode, check whether the values in the beginning of the file match the BOM link I commented. – EdChum Sep 03 '14 at 13:43
  • You can also check within python for utf-8 using this related question: http://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python – EdChum Sep 03 '14 at 13:52
  • Ok I've tried the above suggestion and it gets ascii for some files and UTF-16LE for others! But still when I try to read it using the relevant encoding, I get `UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)` or something similar! – Lucidnonsense Sep 04 '14 at 08:34

0 Answers0