Specifying unicode literal's encoding on a per-literal basis

Question

According to the documentation, it is possible to define the encoding of the literals used in the python source like this:

# -*- coding: latin-1 -*-

u = u'abcdé'  # This is a unicode string encoded in latin-1

Is there any syntax support to specify the encoding on a literal basis? I am looking for something like:

latin1 = u('latin-1')'abcdé'  # This is a unicode string encoded in latin-1
utf8   = u('utf-8')'xxxxx'    # This is a unicode string encoded in utf-8

I know that syntax does not make sense, but I am looking for something similar. What can I do? Or is it maybe not possible to have a single source file with unicode strings in different encodings?

Martijn Pieters · Accepted Answer · 2013-10-17T07:22:13.323

1

There is no way for you to mark a unicode literal as having using a different encoding from the rest of the source file, no.

Instead, you'd manually decode the literal from a bytestring instead:

latin1 = 'abcdé'.decode('latin1')  # provided `é` is stored in the source as a E9 byte.

or using escape sequences:

latin1 = 'abcd\xe9'.decode('latin1')

The whole point of the source-code codec line is to support using an arbitrary codec in your editor. Source code should never use mixed encodings, really.

edited Oct 17 '13 at 07:22

answered Oct 17 '13 at 07:14

Martijn Pieters

1,048,767
296
4,058
3,343

Thanks. What is `'abcdé'` in your example? An 8-bit string I assume. How do I know how my editor is storing that `é`? I can do a hex dump, of course, but what I am really asking is: how do I *control* how my editor is storing that `é`? (I am using emacs) – blueFast Oct 17 '13 at 07:28
@gonvaled: Generally, you don't. I was kinda wondering why you are asking this; it is *rare* for a source file to mix codecs, and usually a freak accident that this happened. I don't use Emacs, but in *general* I know of no editor that'll let you save a text file with a mixed codec. Modern editors present you with Unicode text handling and will encode and decode for you as needed. – Martijn Pieters Oct 17 '13 at 07:32
@gonvaled: Note that Python does the same thing; the literal is decoded into a unicode object, after which the original source codec no longer matters. – Martijn Pieters Oct 17 '13 at 07:32
@gonvaled: and your source code is *always* 8-bit values. That is the point of the codec at the top. Just because we mostly restrict ourselves to the 7 bits of ASCII doesn't mean the file itself doesn't have those 8th bits in all the saved bytes. – Martijn Pieters Oct 17 '13 at 07:34
So, so, so ... My source file is 8-bit, encoded in latin-1 (say), and python decodes the literals to unicode using the coding provided at the top. And my editor works with unicode strings, encoding the file when saved in (hopefully) the same encoding that I am telling python. I think that is my problem: how do I know that my editor is using the encoding that I am telling python? Does this have something to do with the locale configuration? What if I edit the same file in different computers, with different locales? Will things get mixed? How do I verify in what encoding is a file saved? – blueFast Oct 17 '13 at 07:44
And just to clarify: the reason why I wanted to use multiple encodings in a file is to code some unicode testcases, showcasing different encoding and decoding operations. Just to get a better grasp of the whole thing, because I am continuously getting into trouble with different aspects of my web application (input in forms, to addresses in email addresses in emails, etc – blueFast Oct 17 '13 at 07:51
That's a question of configuring your editor properly. But note that Emacs can use the same system; configure encoding from a comment; Pythons codec header specifically is designed to use the same header! – Martijn Pieters Oct 17 '13 at 08:07
I see. And, for users of emacs, I just discovered `describe-char` and `describe-coding-system`, which gives a lot of information regarding use of coding system. Thanks for following through! – blueFast Oct 17 '13 at 08:17

Specifying unicode literal's encoding on a per-literal basis

1 Answers1