3

I want to do this:

Take the bytes of this utf-8 string:

访视频

Encode those bytes in latin-1 and print the result:

访视频

How do I do this in Python?

# -*- coding: utf-8
s = u'访视频'.encode('latin-1')

Causes this exception:

s = u'访视频'.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
OregonTrail
  • 8,594
  • 7
  • 43
  • 58

2 Answers2

7

What you're asking to do is literally impossible. You can't encode those characters to Latin-1, because those characters don't exist in Latin-1.

To get the output you want, you want to decode the UTF-8 bytes as if they were Latin-1. Like this:

s = u'访视频'.encode('utf-8').decode('latin-1')

However, your desired output doesn't look like actual Latin-1, because in Latin-1, characters \x86 and \x91 are non-printable, so you're going to get this:

è®¿è§ é¢

(Notice that space in the middle in place of , and the missing at the end; those are actually invisible control characters, not spaces.)

It looks like you want a Latin-1 superset, probably Windows codepage 1252. In which case what you really want is:

s = u'访视频'.encode('utf-8').decode('cp1252')
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Hmm, I'm on a Mac and I retrieved that string using an ISO-8559-1 encoding, but your second example is exactly what I wanted. Thanks! – OregonTrail Nov 14 '14 at 20:33
  • @OregonTrail: A lot of websites, text files, etc. claim they're in ISO-8859-1 (not 8559, but I'm sure that was a meaningless typo) when they're actually in some extended version because the author doesn't know the difference. Especially Windows users, who think that their own OEM code page (usually cp1252) is Latin-1. (You also used to occasionally see the codepage Windows uses for remapping MacRoman to be sort of Latin-1-ish, which I forget the number of, but that was a long time ago.) – abarnert Nov 14 '14 at 20:48
  • Any idea how to do the opposite? Take "访视频" and get back "访视频". I can't seem to get it to work. – OregonTrail Nov 14 '14 at 21:07
  • Ok, so `s.encode('latin-1').decode('utf-8')` obviously works in this example, but I'm running into a bigger problem in my actual codebase that I can't pin down. – OregonTrail Nov 14 '14 at 21:21
  • @OregonTrail: As I explained in the answer, `\x86` and `\x91` are non-printable control characters, not `†` and `‘`, and nothing in Latin-1 is `†` or `‘`. So, of course `"访视频".encode('latin-1')` is going to give you an exception. But if you use `encode('cp1252')`, as explained in the answer, it works fine. – abarnert Nov 14 '14 at 22:14
2

you need to first encode to UTF-8 (UTF-8 can encode any Unicode string) and yet fully compatible with the 7-bit ASCII set (any ASCII bytestring is a correct UTF-8–encoded string). :

>>> u'访视频'.encode('UTF-8').decode('latin-1')
u'\xe8\xae\xbf\xe8\xa7\x86\xe9\xa2\x91'

Note : The UTF-8 encoding can handle any Unicode character. It is also backwards compatible with ASCII, so that a pure ASCII file can also be considered a UTF-8 file, and a UTF-8 file that happens to use only ASCII characters is identical to an ASCII file with the same characters

Mazdak
  • 105,000
  • 18
  • 159
  • 188