Finding the encoding opening csv file in python

Question

I have problems understanding how to detect the proper encoding of a csv file. I created a small csv file as a sample for testing, cutting and pasting some rows from one of the original files I want to process, and saved that information in my local excel, as CSV. My program can handle this or similar files without problem, but when I try to open a file sent to me from another computer, the program exits with an error.

The section of the code that opens the file:

with open(file_path,'r') as f:
    dialect = csv.Sniffer().sniff(f.read(1024))
    f.seek(0)
    reader = csv.DictReader(f, fieldnames=['RUT', 'Nombre', 'Telefono'], dialect=dialect)
    for row in reader:
        numeros.append(row['Telefono'])

The error:

Traceback (most recent call last):
  File "C:/Users/.PyCharmEdu3.5/config/scratches/scratch.py", line 22, in <module>
    for row in reader:
  File "C:\Program Files\Python35\lib\csv.py", line 110, in __next__
    row = next(self.reader)
  File "C:\Program Files\Python35\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6392: character maps to <undefined>

Process finished with exit code 1

My locale.getpreferredencoding() is 'cp1252'

I did a couple of attempts to guess the encoding:

with open(file_path,'r', encoding='cp1252') as f:

It works with my local generated csv, but not with the ones I'm sent.

with open(file_path,'r', encoding='utf-8') as f:

Doesn't work with any file, but it generates a different error:

Traceback (most recent call last):
  File "C:/Users/.PyCharmEdu3.5/config/scratches/scratch.py", line 19, in <module>
    dialect = csv.Sniffer().sniff(f.read(1024))
  File "C:\Program Files\Python35\lib\codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 1670: invalid continuation byte

Process finished with exit code 1

I tried too adding newline='' to the open() but it doesn't make a difference.

Following an answer from stackoverflow, I opened the file with notepad, and checked encoding in 'Save as', both my local files and the ones I receive from emails show 'ANSI' as the encoding.

Do I need to figure out the encoding by myself, or python can do that for me? Is there something wrong in my code?

I'm using Python 3.5, and the files are most likley created in computers with Spanish OS.

Update: I been doing some more testing. Almost all csv files open without problems, and the program runs correctly, but there are 2 files that cause an error when I try to open them. If I use excel, or notepad this files look normal. I suspect that the files were created or saved on a computer with an uncommon OS or language.

May I see the output of : `offset = 6392` `print(ascii(open('the_file', 'rb').read()[offset-8:offset+8]))` — stovfl, Apr 18 '17 at 09:49
@stovfl **b'BECERRA;96599251'** The ';' is the field divider on this csv, and on the other files that work too. — Pablo, Apr 18 '17 at 13:26
Can you paste one of these troublesome files somewhere that would be accessible to us for our examination, and then let me know with a comment? — Bill Bell, Apr 18 '17 at 14:08
Have you double checked, the same file that gives this: _"UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6392"_. Where are the byte 0x9d? — stovfl, Apr 18 '17 at 14:54

Finding the encoding opening csv file in python

0 Answers0