UnicodeDecodeError using Django and format-strings

Question

I wrote a small example of the issue for everybody to see what's going on using Python 2.7 and Django 1.10.8

# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, unicode_literals, print_function

import time
from django import setup
setup()
from django.contrib.auth.models import Group

group = Group(name='schön')

print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))

print(group)
print(repr(group))
print(str(group))
print(unicode(group))

time.sleep(1.0)
print('%s' % group)
print('%r' % group)   # fails
print('%s' % [group]) # fails
print('%r' % [group]) # fails

Exits with the following output + traceback

$ python .PyCharmCE2017.2/config/scratches/scratch.py
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
Traceback (most recent call last):
  File "/home/srkunze/.PyCharmCE2017.2/config/scratches/scratch.py", line 22, in <module>
    print('%r' % group) # fails
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Has somebody an idea what's going on here?

are you able to complement your question with a test where you are not nesting your "group" object inside a vector? — David Bern, Oct 13 '17 at 09:51
I have a feeling that your representation method inside group class def is doing something naughty :P — David Bern, Oct 13 '17 at 11:41
I have a hard time to replicate your error. But, I have encountered the same problem. The cause was that a "ä" was decoded in the __repr__ method. Initially that worked well, until the day I imported unicode_literals from __future__. The solutions then was to simply remove the use of decodes and __repr__ to return a unicode. — David Bern, Oct 13 '17 at 14:06
The problem is Group.__repr__ is not my code it's from Django. — Sven R. Kunze, Oct 14 '17 at 13:26
@DavidBern did you also use Django or did you roll your own class implementation? — Sven R. Kunze, Oct 16 '17 at 13:40
My similar problem was a non django project. But the error is very familiar. I might have a solution to you in a couple of hours. Have to leave work first. — David Bern, Oct 16 '17 at 15:40
You are interpolating into unicode strings, which include an implicit decode. Use `b'...'` bytestrings instead. — Martijn Pieters, Oct 22 '17 at 14:04
@DavidBern: It is trivially reproducible: `u'%s' % ''`. At issue here is the `from __future__ import unicode_literals` used by the OP. — Martijn Pieters, Oct 22 '17 at 14:15

Martijn Pieters · Accepted Answer · 2017-10-23T09:33:06.527

At issue here is that you are interpolating UTF-8 bytestrings into a Unicode string. Your '%r' string is a Unicode string because you used from __future__ import unicode_literals, but repr(group) (used by the %r placeholder) returns a bytestring. For Django models, repr() can include Unicode data in the representation, encoded to a bytestring using UTF-8. Such representations are not ASCII safe.

For your specific example, repr() on your Group instance produces the bytestring '<Group: sch\xc3\xb6n>'. Interpolating that into a Unicode string triggers the implicit decoding:

>>> u'%s' % '<Group: sch\xc3\xb6n>'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Note that I did not use from __future__ import unicode_literals in my Python session, so the '<Group: sch\xc3\xb6n>' string is not a unicode object, it is a str bytestring object!

In Python 2, you should avoid mixing Unicode and byte strings. Always explicitly normalise your data (encoding Unicode to bytes or decoding bytes to Unicode).

If you must use from __future__ import unicode_literals, you can still create bytestrings by using a b prefix:

>>> from __future__ import unicode_literals
>>> type('')   # empty unicode string
<type 'unicode'>
>>> type(b'')  # empty bytestring, note the b prefix
<type 'str'>
>>> b'%s' % b'<Group: sch\xc3\xb6n>'  # two bytestrings
'<Group: sch\xc3\xb6n>'

Until we can upgrade to Python 3, we need to be more careful about this kind of string operations. For the time being, we will probably install the monkey patch to django models: __repr__.encode('ascii', 'replace'). Maybe, you can add this to your reply for future readers. — Sven R. Kunze, Nov 06 '17 at 12:03

score 3 · Answer 2 · edited Oct 17 '17 at 06:48

I had a hard time finding general solution to your problem. __repr__() is what I understand supposed to return str, any efforts to change that seems to cause new problems.

Regarding the fact that the __repr__() method is defined outside the project, you are able to overload methods. For example

def new_repr(self):
    return 'My representation of self {}'.format(self.name)

Group.add_to_class("__repr__", new_repr)

The only solution I can find, that works is to explicitly tell the interpreter how to handle the strings.

from __future__ import unicode_literals
from django.contrib.auth.models import Group

group = Group(name='schön')

print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))

print(group)
print(repr(group))
print(str(group))
print(unicode(group))

print('%s' % group)
print('%r' % repr(group))
print('%s' % [str(group)])
print('%r' % [repr(group)])

# added
print('{}'.format([repr(group).decode("utf-8")]))
print('{}'.format([repr(group)]))
print('{}'.format(group))

Working with strings in python 2.x is a mess. Hope this brings some light into how to work around (which is the only way I can find) the problem.

guettli · Answer 3 · 2017-10-27T09:17:51.600

1

I think the real issue is in the django code.

It was reported six years ago:

https://code.djangoproject.com/ticket/18063

I think patch to django would solve it:

def __repr__(self):
    return self.....encode('ascii', 'replace')

I think the repr() method should return "7 bit ascii".

edited Oct 27 '17 at 09:17

answered Oct 27 '17 at 08:05

guettli

25,042
81
346
663

@TechJS a placeholder for something which is not important in this context. – guettli Oct 27 '17 at 09:16
Not sure that Django is in the wrong there. Like `__str__` in Python 2, there is *no requirement to return ASCII-safe data*. Sure, core Python types do this, but it is not a stated requirement. – Martijn Pieters Oct 27 '17 at 09:57
1

@GhostlyMartijn yes, you are right. I checked the docs. There is no official requirement, but I would call it "best practice". Or "avoid confusion". Docs: https://docs.python.org/2/reference/datamodel.html#object.__repr__ – guettli Oct 27 '17 at 10:33
1

Well, the *cause* here is still that the OP is mixing bytestrings and unicode strings. If the string was ASCII safe things happen to work, but you really shouldn't mix them anyway. Python 3 would prevent this scenario altogether. – Martijn Pieters Oct 27 '17 at 10:35

anjaneyulubatta505 · Answer 4 · 2017-10-24T17:49:45.850

-1

If it's the case then we need to override the unicode method with our customised method. Try below code. It will work. I have tested it.

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from django.contrib.auth.models import Group

def custom_unicode(self):
    return u"%s" % (self.name.encode('utf-8', 'ignore'))
Group.__unicode__ = custom_unicode

group = Group(name='schön')

# Tests
print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))

print(group)
print(repr(group))
print(str(group))
print(unicode(group))

print('%s' % group)
print('%r' % group)  
print('%s' % [group])
print('%r' % [group])

# output:
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
<Group: schön>
[<Group: schön>]
[<Group: schön>]

Reference: https://docs.python.org/2/howto/unicode.html

edited Oct 24 '17 at 17:49

answered Oct 23 '17 at 17:48

anjaneyulubatta505

10,713
1
52
62

The model "Group" is not my code. It is from django. I can't modify it. – guettli Oct 24 '17 at 08:36
@guettli I have updated code and tested it in ubuntu. Please check it once. – anjaneyulubatta505 Oct 24 '17 at 17:50
This doesn't solve the problem. The model will still have non-ASCII bytes in the `__repr__`. But you now *also* have UTF-8 bytes in the `__unicode__` result **where they don't belong**. And you get the exact same result anyway because you messed with the default encoding, so the whole exercise is pointless. – Martijn Pieters Oct 25 '17 at 21:46
The only thing you did that makes it all work is the `setdefaultencoding()` call, which is like tying a stick to your broken leg. **It is the wrong solution**. You should set the broken bone instead, that is, to not mix bytestrings and Unicode text in the first place. – Martijn Pieters Oct 25 '17 at 21:49

score -1 · Answer 5 · answered Oct 26 '17 at 07:24

-1

I am not familiar with Django. Your issue seems to be representing text data in ASCI which is actually in unicode. Please try unidecode module in Python.

from unidecode import unidecode
#print(string) is replaced with 
print(unidecode(string))

Refer Unidecode

answered Oct 26 '17 at 07:24

Sreeragh A R

2,871
3
27
54

1

This is not needed. The module is great for when encoding to target that's limited to ASCII only, but that's not the case here. They *already have encoded bytes*, at issue here is the implicit *decoding back to unicode*. This module won't help there. – Martijn Pieters Oct 26 '17 at 21:25

UnicodeDecodeError using Django and format-strings

5 Answers5

Linked