17

I tried to understand by myself encode and decode in Python but nothing is really clear for me.

  1. str.encode([encoding,[errors]])
  2. str.decode([encoding,[errors]])

First, I don't understand the need of the "encoding" parameter in these two functions.

What is the output of each function, its encoding? What is the use of the "encoding" parameter in each function? I don't really understand the definition of "bytes string".

I have an important question, is there some way to pass from one encoding to another? I have read some text on ASN.1 about "octet string", so I wondered whether it was the same as "bytes string".

Thanks for you help.

dda
  • 6,030
  • 2
  • 25
  • 34
  • 4
    But you did read the [docs](http://docs.python.org/library/stdtypes.html#str.encode), didn't you. Sorry I'm asking – tiwo Jul 21 '12 at 23:36

4 Answers4

25

It's a little more complex in Python 2 (compared to Python 3), since it conflates the concepts of 'string' and 'bytestring' quite a bit, but see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Essentially, what you need to understand is that 'string' and 'character' are abstract concepts that can't be directly represented by a computer. A bytestring is a raw stream of bytes straight from disk (or that can be written straight from disk). encode goes from abstract to concrete (you give it preferably a unicode string, and it gives you back a byte string); decode goes the opposite way.

The encoding is the rule that says 'a' should be represented by the byte 0x61 and 'α' by the two-byte sequence 0xc0\xb1.

dda
  • 6,030
  • 2
  • 25
  • 34
lvc
  • 34,233
  • 10
  • 73
  • 98
  • if I understand, a string doesn't have a really sens out of the interpreter, it can't be exchange between machines so when I write str.encoding("ascii"), str becomes real, it is coded according to the ascii spécifications in that exemple in the memory and its encoding value it the same as define by ascii and this encoding is called a "bytes string": it is right?? – Narcisse Doudieu Siewe Jul 22 '12 at 00:29
  • can one byte in a "bytes string" support an addition? – Narcisse Doudieu Siewe Jul 22 '12 at 00:39
  • 1
    @NarcisseDoudieuSiewe for the first question, yes, that is right, although your terminology is a little mixed up - the "encoding" is ASCII, not the final bytestring - that might be called "an ascii-encoded string", for example. For the second question, one element of a bytestring is a one-element byte string (in Py2, a bytestring is just the `str` type, and a string is the `unicode` type), so `b[0] + b[0]` does concatenation. This is different in Py3, where one element of a bytestring is actually an `int` and so `b[0] + b[0]` does int addition. – lvc Jul 22 '12 at 01:01
  • :) I suppose "byte string" that and "bytestring" are not the same because at this web page: http://docs.python.org/library/codecs.html after "7.8.3 standard encoding" and between the first and the second encoding table, it is clearly mentioned " For the codecs listed below, the result in the “encoding” direction is always a byte string.". Thanks for you help about all of my questions – Narcisse Doudieu Siewe Jul 22 '12 at 01:44
  • I want to know some thing, at this time I use python ctypes to make some wifi frame structure and with lorcon2 I could send them across the Lan. I want to transform this structure in string to get an hexadecimal représentation of this structure. I have seen for this purpose two functions which accomplish this. the ctypes.string_at and ctypes.wstring_at functions. I know that ctypes.wstring_at is for make an unicode string but ctypes.string_at is for ??? which kind of string could we get with it?? an ascii string?? or an hexadecimal string?? – Narcisse Doudieu Siewe Jul 22 '12 at 15:16
  • This is an awesome article for anyone that could use some help with python and unicode: http://nedbatchelder.com/text/unipain/unipain.html#1 (use the keyboard arrow keys to navigate through the slides) – Homer6 Jan 28 '14 at 21:36
18

My presentation from PyCon, Pragmatic Unicode, or, How Do I Stop The Pain covers all of these details.

Briefly, Unicode strings are sequences of integers called code points, and bytestrings are sequences of bytes. An encoding is a way to represent Unicode code points as a series of bytes. So unicode_string.encode(enc) will return the byte string of the Unicode string encoded with "enc", and byte_string.decode(enc) will return the Unicode string created by decoding the byte string with "enc".

Matt
  • 179
  • 7
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
7

Python 2.x has two types of strings:

  • str = "byte strings" = a sequence of octets. These are used for both "legacy" character encodings (such as windows-1252 or IBM437) and for raw binary data (such as struct.pack output).
  • unicode = "Unicode strings" = a sequence of UTF-16 or UTF-32 depending on how Python is built.

This model was changed for Python 3.x:

  • 2.x unicode became 3.x str (and the u prefix was dropped from the literals).
  • A bytes type was introduced for representing binary data.

A character encoding is a mapping between Unicode strings and byte strings. To convert a Unicode string, to a byte string, use the encode method:

>>> u'\u20AC'.encode('UTF-8')
'\xe2\x82\xac'

To convert the other way, use the decode method:

>>> '\xE2\x82\xAC'.decode('UTF-8')
u'\u20ac'
dan04
  • 87,747
  • 23
  • 163
  • 198
4

Yes, a byte string is an octet string. Encoding and decoding happens when inputting / outputting text (from/to the console, files, the network, ...). Your console may use UTF-8 internally, your web server serves latin-1, and certain file formats need strange encodings like Bibtex's accents: fran\c{c}aise. You need to convert from/to them on input/output.

The {en|de}code methods do this. They are often called behind the scenes (for example, print "hello world" encodes the string to whatever your terminal uses).

tiwo
  • 3,238
  • 1
  • 20
  • 33