serializing to JSON that would retain hebrew charcters

Question

I have the following use case:

from data I produce a json with data, part of it hebrew words. for example:

import json
j = {}
city =u'חיפה' #native unicode
j['results']= []
j['results'].append({'city':city}) #Also tried to city.encode('utf-8') and other encodings

In order to produce a json file that doubles as my app db (a micro geoapp) and as a file my users can edit and fix data directly I use the json lib and:

to_save = json.dumps(j)
with open('test.json','wb') as f: #also tried with w instead of wb flag.
   f.write(to_save)
   f.close()

The problem is I get a unicode decoded json with u'חיפה' represented for example as: u'\u05d7\u05d9\u05e4\u05d4'

most of the script and app don't have any problem reading the Unicodestring but my USERS have one!, and since contributing to the opensource project they need to edit the JSON directly, they can't figure out the Hebrew text.

so, THE QUESTION: how should I write the json while opening it in another editor would show Hebrew characters?

I'm not sure this is solvable because I suspect JSON is unicode all the way and I can't use asccii in it, but not sure about that.

Thanks for the help

falsetru · Accepted Answer · 2013-08-28T10:17:33.247

19

Use ensure_ascii=False argument.

>>> import json
>>> city = u'חיפה'
>>> print(json.dumps(city))
"\u05d7\u05d9\u05e4\u05d4"
>>> print(json.dumps(city, ensure_ascii=False))
"חיפה"

According to json.dump documentation:

If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the result is a str instance consisting of ASCII characters only. If ensure_ascii is False, some chunks written to fp may be unicode instances. This usually happens because the input contains unicode strings or the encoding parameter is used. Unless fp.write() explicitly understands unicode (as in codecs.getwriter()) this is likely to cause an error.

Your code should read as follow:

import json
j = {'results': [u'חיפה']}
to_save = json.dumps(j, ensure_ascii=False)
with open('test.json', 'wb') as f:
    f.write(to_save.encode('utf-8'))

or

import codecs
import json
j = {'results': [u'חיפה']}
to_save = json.dumps(j, ensure_ascii=False)
with codecs.open('test.json', 'wb', encoding='utf-8') as f:
    f.write(to_save)

or

import codecs
import json
j = {'results': [u'חיפה']}
with codecs.open('test.json', 'wb', encoding='utf-8') as f:
    json.dump(j, f, ensure_ascii=False)

edited Aug 28 '13 at 10:17

answered Aug 28 '13 at 06:41

falsetru

357,413
63
732
636

Just a note: According to [JSON docs](http://www.json.org/) a value can contain any Unicode characters (except `"` , `\ ` and control chars) _or_ escaped ones (like `\u05d7`). So both versions of `json.dump` produce valid JSON. – Andras Nemeth Aug 28 '13 at 06:59
@falsetru what was biting me was the ```fp.write()```, which sent me to a whole goose chase after the correct ```json.dumps``` config (since I already tried the one you suggested) - the ```codecs.write()``` tip was the missing part for me. notice you should ```codecs.open('file.json','wb',encoding='utf-8')``` for this to work. – alonisser Aug 28 '13 at 07:03
@falsetru notice you don't need the ```to_save``` line now, your version is more elegant – alonisser Aug 28 '13 at 10:16
@alonisser, You are right. The last version does not need `to_save`. I edited that. – falsetru Aug 28 '13 at 10:18

serializing to JSON that would retain hebrew charcters

1 Answers1