-1

I'm using Python 2.7

I'm reading a file containing "iso-8859-1" coded information. After parsing, I get the results in strings, ie s1:

>>> s1
'D\xf6rfli'
>>> type(s1)
<type 'str'>
>>> s2=s1.decode("iso-8859-1").encode("utf8")
>>> s2
'D\xc3\xb6rfli'
>>> type(s2)
<type 'str'>
>>> print s1, s2
D�rfli Dörfli
>>> 

Why is the type of s2 still a str after the call to .encode? How can I convert it from str to utf-8?

Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
jdpiguet
  • 33
  • 2
  • I'm not familiar with Python, but what makes you think `utf-8` is a type? Also, the output is as expected; what more do you want? – Mr Lister Jan 06 '13 at 12:52
  • 1
    This presentation may help you understand the fundamentals: [Pragmatic Unicode, or, How Do I Stop The Pain?](http://bit.ly/unipain). – Ned Batchelder Jan 06 '13 at 13:07

2 Answers2

2

str in Python 2 means an encoded string, i.e. a sequence of bytes. This is documented behavior. The decoded str would be of type unicode.

UTF-8 is an encoding, as well as ISO-8859-1. So you just decode your string and then encode in another encoding, producing data of the same type.

On the contrary, in Python 3 str would be a text string (in Unicode) and calling encode on it would give you an instance of bytes.

So, in Python 2, a UTF-8 string will be str, because it is encoded.

I second the recommendation by Ned: take a look at the presentation he links to (oh my, is it his own talk?). It helped me a lot when I was struggling with these things.

Community
  • 1
  • 1
Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
1

I'm not sure if this answers your questions, but here's what I observed.

If you just want to get the string into a printable form, just stop after calling decode. I'm not sure why you are trying to encode into UTF8 after successfully converting from is8859 into unicode.

>>> s1 = 'D\xf6rfli'
>>> s1
'D\xf6rfli'
>>> s2 = s1.decode("iso-8859-1")
>>> s2
u'D\xf6rfli'
>>> print s2
Dörfli
>>> 
selbie
  • 100,020
  • 15
  • 103
  • 173