3

Wrote the following two functions for storing and retrieving any Python (built-in or user-defined) object with a combination of json and jsonpickle (in 2.7)

def save(kind, obj):
    pickled = jsonpickle.encode(obj)
    filename = DATA_DESTINATION[kind] \\returns file destination to store json
    if os.path.isfile(filename):
        open(filename, 'w').close()
    with open(filename, 'w') as f:
        json.dump(pickled, f)

def retrieve(kind):
    filename = DATA_DESTINATION[kind] \\returns file destination to store json
    if os.path.isfile(filename):
        with open(filename, 'r') as f:
            pickled = json.load(f)
            unpickled = jsonpickle.decode(pickled)
            print unpickled

I haven't tested these two functions with user-defined objects, but when i attempt to save() a built-in dictionary of strings, (ie. {'Adam': 'Age 19', 'Bill', 'Age 32'}), and i retrieve the same file, i get the same dictionary back in unicode, {u'Adam': u'Age 19', u'Bill', u'Age 32'}. I thought json/jsonpickle encoded by default to utf-8, what's the deal here?

[UPDATE]: Removing all jsonpickle encoding/decoding does not effect output, still in unicode, seems like an issue with json? Perhaps I'm doing something wrong.

zhuyxn
  • 6,671
  • 9
  • 38
  • 44

4 Answers4

10
import jsonpickle
import json

jsonpickle.set_preferred_backend('json')
jsonpickle.set_encoder_options('json', ensure_ascii=False)
print( jsonpickle.encode( { "value" : "значение"}) )

{"value": "значение"}

Hunter Tran
  • 13,257
  • 2
  • 14
  • 23
typik89
  • 907
  • 1
  • 10
  • 23
1

You can encode the unicode sting after calling loads().

json.loads('"\\u79c1"').encode('utf-8')

Now you have a normal string again.

drjd
  • 399
  • 1
  • 2
  • unicode is a different type of object [link](http://docs.python.org/howto/unicode.html#python-2-x-s-unicode-support) – drjd Aug 12 '12 at 22:09
  • So what. It's still a normal string. – Ignacio Vazquez-Abrams Aug 12 '12 at 22:28
  • :D yes it is a string. But it's encoding is unicode, not utf-8. If you want utf-8 encoding, you have to encode it. I mean with normal string a str object. Python 2 distinguishes between unicode strings and 'normal' strings. type(u'abc') >>> unicode type('abc') >>> str – drjd Aug 12 '12 at 22:50
  • Both `unicode` and `str` are normal strings. The difference is that `unicode` is more accurate since bytestrings need to specify an encoding. – Ignacio Vazquez-Abrams Aug 12 '12 at 23:07
  • His input is an utf-8 str object. The output of load is an unicode object, but he want a utf-8 str object again, so he has to encode the unicode object to a str object. What is the problem with this? --- I preferably would like to get back the same encoding i put in? – zhuyxn 1 hour ago – drjd Aug 12 '12 at 23:32
  • it seems like adding `.encode('utf-8')` does not change the output which is for some reason still in unicode. Using `json.load` rather than `json.loads` since I'm loading from a file (not sure if this is accurate) but `loads` causes an error. – zhuyxn Aug 12 '12 at 23:46
0

I thought json ... encoded by default to utf-8, what's the deal here?

No, it encodes to ASCII. And it decodes to unicode.

>>> json.dumps(u'私')
'"\\u79c1"'
>>> json.loads('"\\u79c1"')
u'\u79c1'
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
0

The problem is that json, as a serialization format, is not expressive enough to carry the information about the original type of strings. In other words, if you have a json string a you can't tell whether is has been originated from a python string "a" or from a python unicode string u"a".

Indeed, you can read in the documentation of the json module about the option ensure_ascii. Basically, depending on where you are going to write the generated json, you might tolerate a unicode string, or need an ascii string with all incoming unicode characters properly escaped.

For example:

>>> import json
>>> json.dumps({'a':'b'})
'{"a": "b"}'
>>> json.dumps({'a':u'b'}, ensure_ascii=False)
u'{"a": "b"}'
>>> json.dumps({'a':u'b'})
'{"a": "b"}'
>>> json.dumps({u'a':'b'})
'{"a": "b"}'
>>> json.dumps({'a':u'\xe0'})
'{"a": "\\u00e0"}'
>>> json.dumps({'a':u'\xe0'}, ensure_ascii=False)
u'{"a": "\xe0"}'

As you can see, depending on the value of ensure_ascii you end up with an ascii json string or a unicode one, but the components of the original objects are all flattened to the same common encoding. Look at {"a": "b"} case in particular.

jsonpickle simply makes use of json as its underlying serialization engine, adding no extra metadata to keep track of the original string types, therefore you are in fact loosing information along the way.

>>> jsonpickle.encode({'a': 'b'})
'{"a": "b"}'
>>> jsonpickle.encode({'a': u'b'})
'{"a": "b"}'
>>> jsonpickle.encode({u'a': 'b'})
'{"a": "b"}'
Stefano Masini
  • 452
  • 3
  • 11