What is the type of an "utf8" string encoding in Python?

Question

I'm using Python 2.7

I'm reading a file containing "iso-8859-1" coded information. After parsing, I get the results in strings, ie s1:

>>> s1
'D\xf6rfli'
>>> type(s1)
<type 'str'>
>>> s2=s1.decode("iso-8859-1").encode("utf8")
>>> s2
'D\xc3\xb6rfli'
>>> type(s2)
<type 'str'>
>>> print s1, s2
D�rfli Dörfli
>>>

Why is the type of s2 still a str after the call to .encode? How can I convert it from str to utf-8?

I'm not familiar with Python, but what makes you think `utf-8` is a type? Also, the output is as expected; what more do you want? — Mr Lister, Jan 06 '13 at 12:52
This presentation may help you understand the fundamentals: [Pragmatic Unicode, or, How Do I Stop The Pain?](http://bit.ly/unipain). — Ned Batchelder, Jan 06 '13 at 13:07

score 2 · Answer 1 · edited May 23 '17 at 12:21

2

str in Python 2 means an encoded string, i.e. a sequence of bytes. This is documented behavior. The decoded str would be of type unicode.

UTF-8 is an encoding, as well as ISO-8859-1. So you just decode your string and then encode in another encoding, producing data of the same type.

On the contrary, in Python 3 str would be a text string (in Unicode) and calling encode on it would give you an instance of bytes.

So, in Python 2, a UTF-8 string will be str, because it is encoded.

I second the recommendation by Ned: take a look at the presentation he links to (oh my, is it his own talk?). It helped me a lot when I was struggling with these things.

edited May 23 '17 at 12:21

Community

1
1

answered Jan 06 '13 at 12:51

Lev Levitsky

63,701
20
147
175

Good explanation, except you used `decode` instead of `encode` in the line about Python 3. – abarnert Jan 06 '13 at 13:03
So, if I right understand, `unicode` is NOT like "utf-8" encoded `str`? – jdpiguet Jan 06 '13 at 13:07
@jdpiguet Correct. `unicode` is a Unicode string, not encoded at all. – Lev Levitsky Jan 06 '13 at 13:12
Thanks, that's the answer to the "Why" question! :) – jdpiguet Jan 06 '13 at 13:24

score 1 · Accepted Answer · answered Jan 06 '13 at 12:53

I'm not sure if this answers your questions, but here's what I observed.

If you just want to get the string into a printable form, just stop after calling decode. I'm not sure why you are trying to encode into UTF8 after successfully converting from is8859 into unicode.

>>> s1 = 'D\xf6rfli'
>>> s1
'D\xf6rfli'
>>> s2 = s1.decode("iso-8859-1")
>>> s2
u'D\xf6rfli'
>>> print s2
Dörfli
>>>

What is the type of an "utf8" string encoding in Python?

2 Answers2