Writing and then reading a string in file encoded in latin1

Question

Here are 2 code samples, Python3 : the first one writes two files with latin1 encoding :

s='On écrit ça dans un fichier.'
with open('spam1.txt', 'w',encoding='ISO-8859-1') as f:
    print(s, file=f)
with open('spam2.txt', 'w',encoding='ISO-8859-1') as f:
    f.write(s)

The second one reads the same files with the same encoding :

with open('spam1.txt', 'r',encoding='ISO-8859-1') as f:
    s1=f.read()
with open('spam2.txt', 'r',encoding='ISO-8859-1') as f:
    s2=f.read()

Now, printing s1 and s2 I get

On Ã©crit Ã§a dans un fichier.

instead of the initial "On écrit ça dans un fichier."

What is wrong ? I also tried with io.open but I miss something. The funny part is that I had no such problem with Python2.7 and its str.decode method which is now gone...

Could someone help me ?

Are you 100% certain the files were written with Latin-1 encoding? That looks awfully much like UTF-8 data.. — Martijn Pieters, Jul 22 '13 at 14:38
`>>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1')` gives `'On Ã©crit Ã§a dans un fichier.'` — Martijn Pieters, Jul 22 '13 at 14:39

Martijn Pieters · Accepted Answer · 2013-07-22T18:48:26.450

8

Your data was written out as UTF-8:

>>> 'On écrit ça dans un fichier.'.encode('utf8').decode('latin1')
'On Ã©crit Ã§a dans un fichier.'

This either means you did not write out Latin-1 data, or your source code was saved as UTF-8 but you declared your script (using a PEP 263-compliant header to be Latin-1 instead.

If you saved your Python script with a header like:

# -*- coding: latin-1 -*-

but your text editor saved the file with UTF-8 encoding instead, then the string literal:

s='On écrit ça dans un fichier.'

will be misinterpreted by Python as well, in the same manner. Saving the resulting unicode value to disk as Latin-1, then reading it again as Latin-1 will preserve the error.

To debug, please take a close look at print(s.encode('unicode_escape')) in the first script. If it looks like:

b'On \\xc3\\xa9crit \\xc3\\xa7a dans un fichier.'

then your source code encoding and the PEP-263 header are disagreeing on how the source code should be interpreted. If your source code is correctly decoded the correct output is:

b'On \\xe9crit \\xe7a dans un fichier.'

If Spyder is stubbornly ignoring the PEP-263 header and reading your source as Latin-1 regardless, avoid using non-ASCII characters and use escape codes instead; either using \uxxxx unicode code points:

s = 'On \u00e9crit \u007aa dans un fichier.'

or \xaa one-byte escape codes for code-points below 256:

s = 'On \xe9crit \x7aa dans un fichier.'

edited Jul 22 '13 at 18:48

answered Jul 22 '13 at 14:41

Martijn Pieters

1,048,767
296
4,058
3,343

@Coulombeau: without some landmarks, I cannot help you find your way. I gave you an indication on how to debug this. How about you update your question with the output of `s.encode('unicode_escape')` and poke me again? – Martijn Pieters Jul 22 '13 at 16:46
Well, I'm lost ! I edited only the first script as you requested. Made the test `print(s.encode('unicode_escape'))` which gave me the first buggy result you cited. The decided to add an header (which I hadn't done before) and put `# -*- coding: utf-8 -*-` or tried also ascii or latin1. Nothing changed. Then I wrote the simple lines (I need to understand, let's take it very simple !) : `# -*- coding: utf-8 -*- s='On écrit ça dans un fichier.' print(s.encode('utf-8').decode('utf-8'))` which gave me... On Ã©crit Ã§adans un fichier. – François Coulombeau Jul 22 '13 at 16:49
sorry for the first useless comment, i published it by mistake and then was unable to edit it because 5mins passed – François Coulombeau Jul 22 '13 at 16:51
Your editor is then saving your source code as Latin1 instead. – Martijn Pieters Jul 22 '13 at 16:54
Well, can it be a problem xith my distribution ? I'm using Windows 7 with WinPython3.3(64bits). I've not really the choice for the distribution as I'm a teacher and that's the distribution on the computers of the CPGE (french system...) in which I'm teaching. Anyway. I'm editing with spyder right now. Should I change or maybe edit some configuration file ? The bad result for `print(s.encode('utf-8').decode('utf-8'))` is really ununderstandable to me... – François Coulombeau Jul 22 '13 at 17:08
That's because `s` *itself* is already incorrect. This is not a problem with your distribution but with how your source code is saved. – Martijn Pieters Jul 22 '13 at 17:10
first Hex of my file : 23 20 2D... which correspond to # - So the coding of the file should be ANSI or CP1252 but with header `# -*- coding: cp1252 -*-` nothing changes... And I precise that the string is correctly written in the editor or in the console by a `print(s)` – François Coulombeau Jul 22 '13 at 17:32
@Coulombeau: And what does `print(s.encode('unicode_escape'))` tell you about the value? – Martijn Pieters Jul 22 '13 at 18:28
Ok, I think it's a bug of Spyder or at least I got an hint. I executed the very simple code with `# -*- coding: utf-8 -*-` as header (because my hex editor gave me c3 c9 for é even if the heading hex for utf8 were missing), then defining `s='On écrit ça dans la console.'` on a second line, and `print(s)` on the third. With Spyder, I get `On Ã©crit Ã§a dans la console.` in the console window even if the string is alright in the editor window. Then I loaded and runned the same file (without editing) under IDLE : and the result was correct ! – François Coulombeau Jul 22 '13 at 18:41
@Coulombeau: Interesting! Seems Spyder is ignoring the PEP 263 header. You can use `\uxxxx` escapes instead when creating a literal. – Martijn Pieters Jul 22 '13 at 18:45
The `print(s.encode('unicode_escape'))` is alright when the script is runned from IDLE, and wrong when runned form Spyder. – François Coulombeau Jul 22 '13 at 18:46
Thanks a lot for your help, I wouldn't have thought about the encoding of my script.py if you hadn't pointed it out ! – François Coulombeau Jul 22 '13 at 18:48

Writing and then reading a string in file encoded in latin1

1 Answers1