Python unicode string literals in module declared as utf-8

Question

I have a dummie Python module with the utf-8 header that looks like this:

# -*- coding: utf-8 -*-
a = "á"
print type(a), a

Which prints:

<type 'str'> á

But I thought that all string literals inside a Python module declared as utf-8 whould automatically be of type unicode, intead of str. Am I missing something or is this the correct behaviour?

In order to get a as an unicode string I use:

a = u"á"

But this doesn't seem very "polite", nor practical. Is there a better option?

@MarkRansom I can't change the Python version because of compatibility issues — Caumons, Nov 04 '13 at 15:26
What is not 'polite' about using a `u'..'` unicode literal? *Why* do you feel it is impractical? — Martijn Pieters, Nov 04 '13 at 15:28
Because it's really easy to forget using it, and you have to add the `u` char before ALL strings. The desired behaviour is the one with Python 3 — Caumons, Nov 04 '13 at 15:29
@Caumons: Then use Python 3, where you'll run into the same issues when you need to use `b'...'` to declare a byte string whenever you *don't* need a unicode value. — Martijn Pieters, Nov 04 '13 at 15:30
OK, so the conclusion is that when you need unicode literals, you have to "produce" them, or declare them with `u`. The heading encode applies to how these unicode strings are actually encoded but doesn't affect `str`. — Caumons, Nov 04 '13 at 15:32
`from __future__ import unicode_literals` will turn all literals into Unicode literals (Python 2.6 and 2.7). — Sven Marnach, Nov 04 '13 at 15:32

score 6 · Answer 1 · 2013-11-04T15:40:23.983

6

# -*- coding: utf-8 -*-

doesn't make the string literals Unicode. Take this example, I have a file with an Arabic comment and string, file is utf-8:

# هذا تعليق عربي
print type('نص عربي')

if I run it it will throw a SyntaxError exception:

SyntaxError: Non-ASCII character '\xd9' in file file.py
on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

so to allow this I have to add that line to tell the interpreter that the file is UTF-8 encoded:

# -*-coding: utf-8 -*-

# هذا تعليق عربي
print type('نص عربي')

now it runs fine but it still prints <type 'str'> unless I make the string Unicode:

# -*-coding: utf-8 -*-

# هذا تعليق عربي
print type(u'نص عربي')

edited Nov 04 '13 at 15:40

answered Nov 04 '13 at 15:35

Even though using the file encoding makes the non ascii `str` print OK, when processing these kind of strings would cause a runtime error, wouldn't it? So the moral would be, if using non ascii chars, a part from the header, use always `u` with the potential conflictive strings. Am I right? – Caumons Nov 04 '13 at 15:40
@Caumons It's not printed or handled right if it's not unicode, one character is considered two different characters, `print len(u'أ'); print len('أ')`. – Nov 04 '13 at 15:44
OK, so using `u` from now :) Thanks! – Caumons Nov 04 '13 at 15:49

Martijn Pieters · Accepted Answer · 2013-11-04T15:41:23.927

No, the codec at the top only informs Python how to interpret the source code, and uses that codec to interpret Unicode literals. It does not turn literal bytestrings into unicode values. As PEP 263 states:

This PEP proposes to introduce a syntax to declare the encoding of a Python source file. The encoding information is then used by the Python parser to interpret the file using the given encoding. Most notably this enhances the interpretation of Unicode literals in the source code and makes it possible to write Unicode literals using e.g. UTF-8 directly in an Unicode aware editor.

Emphasis mine.

Without the codec declaration, Python has no idea how to interpret non-ASCII characters:

$ cat /tmp/test.py 
example = '☃'
$ python2.7 /tmp/test.py 
  File "/tmp/test.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file /tmp/test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

If Python behaved the way you expect it to, you would not be able to literal bytestring values that contain non-ASCII byte values either.

If your terminal is configured to display UTF-8 values, then printing a UTF-8 encoded byte string will look 'correct', but only by virtue of luck that the encodings match.

The correct way to get unicode values, is by using unicode literals or by otherwise producing unicode (decoding from byte strings, converting integer codepoints to unicode characters, etc.):

unicode_snowman = '\xe2\x98\x83'.decode('utf8')
unicode_snowman = unichr(0x2603)

In Python 3, the codec also applies to how variable names are interpreted, as you can use letters and digits outside of the ASCII range in names. The default codec in Python 3 is UTF-8, as opposed to ASCII in Python 2.

Having a non-ASCII character in a byte string results in a syntax error if you don't specify a file encoding, although the encoding doesn't influence the resulting string in any way. — Sven Marnach, Nov 04 '13 at 15:31
@SvenMarnach thanks so much for this comment!!! This fully clarifies my doubt! :) I'll accept this answer — Caumons, Nov 04 '13 at 15:35
@SvenMarnach: right, even though the bytestring value itself won't change based on the codec set, python is conservative here and won't try and parse the Python source code without a codec set. — Martijn Pieters, Nov 04 '13 at 15:37

score 2 · Answer 3 · edited Jun 20 '20 at 09:12

No this is just source code encoding. Please see http://www.python.org/dev/peps/pep-0263/

To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:
      # coding=<encoding name>

or (using formats recognized by popular editors)

      #!/usr/bin/python
      # -*- coding: <encoding name> -*-

or

      #!/usr/bin/python
      # vim: set fileencoding=<encoding name> :

This doesn't make all literals unicode just point how unicode literals should be decoded.

One should use unicode function or u prefix to set literal as unicode.

N.B. in python3 all strings are unicode.

Python unicode string literals in module declared as utf-8

3 Answers3