problem with unicode decoding

Question

This is funny.. I am trying to read geographic lookup data from openstreetmap. The code that performs the query looks like this

params = urllib.urlencode({'q': ",".join([e for e in full_address]), 'format': "json", "addressdetails" : "1"})
query = "http://nominatim.openstreetmap.org/search?%s" % params
print query
time.sleep(5)
response = json.loads(unicode(urllib.urlopen(query).read(), "UTF-8"), encoding="UTF-8")
print response

The query for Zürich is correctly URL-encoded on UTF-8 data. No wonders here.

http://nominatim.openstreetmap.org/search?q=Z%C3%BCrich%2CSWITZERLAND&addressdetails=1&format=json

When I print the response, the u with umlaut is encoded latin1 (0xFC)

[{u'display_name': u'Z\xfcrich, Bezirk Z\xfcrich, Z\xfcrich, Schweiz, Europe', u'place_id': 588094, u'lon': 8.540443

but that's nonsense because openstreetmap returns the JSON data in UTF-8

Connecting to nominatim.openstreetmap.org (nominatim.openstreetmap.org)|128.40.168.106|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Wed, 26 Jan 2011 13:48:33 GMT
  Server: Apache/2.2.14 (Ubuntu)
  Content-Location: search.php
  Vary: negotiate
  TCN: choice
  X-Powered-By: PHP/5.3.2-1ubuntu4.7
  Access-Control-Allow-Origin: *
  Content-Length: 3342
  Keep-Alive: timeout=15, max=100
  Connection: Keep-Alive
  Content-Type: application/json; charset=UTF-8
Length: 3342 (3.3K) [application/json]

which is also confirmed by the file contents, and then I explicitly say that it's UTF-8 both at read and json parsing.

What's going on here ?

EDIT : apparently it's the json.loads that screws up somehow.

etarion · Accepted Answer · 2011-01-26T14:01:41.420

8

When I go and print the response, the u with umlaut is encoded latin1 (0xFC)

You are just misinterpreting the output. It's a unicode string (you can tell by the u in prefix), there's no encoding "attached" - the \xFC means there it's the codepoint with number 0xFC, which happens to be the U-Umlaut (see http://www.fileformat.info/info/unicode/char/fc/index.htm). The reason why this happens is that the numbering of the first 256 unicode codepoints coincides with the latin1 encoding.

In short, you did everything right - you have a unicode object with the right content (that is agnostic to encodings), you can choose the encoding you want when you use that content for output somewhere by doing unicodestr.encode("utf-8") or by using codecs, see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data

edited Jan 26 '11 at 14:01

answered Jan 26 '11 at 13:54

etarion

16,935
4
43
66

@etarion but it says `UTF-8 (hex) 0xC3 0xBC` in the table. Shouldn't it be represented as such in UTF-8 content? If I'm not mistaken, if I take the `oxFC` literally and use it as a character in a UTF-8 string, it's going to be an invalid character. – Pekka Jan 26 '11 at 13:57
maybe you are right... How can I check if it's the codepoint or the actual data ? try to decode it to ascii ? – Stefano Borini Jan 26 '11 at 13:58
+1: I thought it might be something like that, but my python-fu isn't high enough to verify it properly. writing `print u'Z\xfcrich'` in a UTF8 console printed the right thing, after all. – araqnid Jan 26 '11 at 13:58
@araqnid Interesting. But I don't understand why it does that! In my understanding, it shouldn't. (I don't speak Python at all so I can't verify it either...) – Pekka Jan 26 '11 at 14:00
@Stefano, try `print response(0)['display_name']` to confirm that this is indeed correct. – Daniel Roseman Jan 26 '11 at 14:03
@Stefano Borini: If you see a 'u' as the prefix of the string in python output, it's a unicode string and escape sequences inside it mean codepoints. If there's no such prefix, it's a (byte) string and escape sequences mean bytes (actual data). – etarion Jan 26 '11 at 14:03
2

@Pekka: Python doesn't use UTF-8 for internal string representation. `00FC` is the Unicode code point for `ü`, as @etarion has explained. When you do encode it to UTF-8, it will be changed to `b"\xc3\xbc"`, but only then. `\xfc` is the same as `\u00fc`. – Tim Pietzcker Jan 26 '11 at 14:13
@Tim ah, that makes sense. But shouldn't it then be a double-byte `b\x00\xFC` ? Isn't one-byte `\xfc` the latin-1 represenatation? Forgive me if it's a dumb question, as said I have no clue about Python, just curious – Pekka Jan 26 '11 at 14:15
@Pekka: I just edited my previous comment; `\xfc` is sort of a shorthand for `\u00fc`. Works only for single-byte codepoints (i. e. those that start with `00` as Unicode code points) because those are equivalent to the latin-1 codepoints. – Tim Pietzcker Jan 26 '11 at 14:16
1

It's not about python, it's about unicode. A unicode string is a sequence of codepoints, and 0xFC is the codepoint for the U-Umlaut. It's not bytes - a unicode string has no notion of "byte". – etarion Jan 26 '11 at 14:17

score 1 · Answer 2 · answered Jan 26 '11 at 14:05

The output is fine. Whenever you print data on the console, Python encondes Unicode the data only when printing the actual string. If you print a list of unicodes, each unicode string is show on the console as its repr():

>>> a=u'á'
>>> a
u'\xe1'
>>> print a
á
>>> [a]
[u'\xe1']
>>> print [a]
[u'\xe1']

problem with unicode decoding

2 Answers2