Python, Unicode: Working with badly formatted strings

Question

I'm working on a Django app that gets data from a Redshift database. The people who wrote to this Redshift database made some bizarre choices. I'm trying to figure out how handle the fact that it has strings like this:

Let’s play! \xe2\x9a\xbd\xef\xb8\x8f

With psycopg2 I'm getting these into Python as

u'Let\u2019s play! \\xe2\\x9a\\xbd\\xef\\xb8\\x8f'

I'm supposed to render this as

Let’s play! ⚽️

They're stored in the Redshift table as varchar(65535). I'm using pyscopg2 to pull them into the app backend (Django 1.11, Python 2.7). They are being rendered with JavaScript on the front end.

Any suggestions would be appreciated -- I can't figure out any solution out at all.

You've got [mojibake](https://en.wikipedia.org/wiki/Mojibake) in the database. It has a mix of encodings. I'm unsure how you even got the result you did in psycopg2, because I couldn't find an encoder that could return that result. Getting a U+2019 codepoint implies a Windows-125x encoder, but none of them return the UTF-8 remainder in Python without an error. A correct UTF-8 encoding of the string would be `'Let\xe2\x80\x99s play! \xe2\x9a\xbd\xef\xb8\x8f'` — Mark Tolonen, Oct 11 '17 at 15:48
Whoops, you're right about what I'm getting from psycopg2. Updated question. — Nathan Hinchey, Oct 11 '17 at 16:14

score 1 · Accepted Answer · answered Oct 11 '17 at 17:00

1

Because there is a mix of encodings (what looks like Windows-1252 and hex-escaped UTF-8) you could try to post-process the entries if this mix is consistent.

Below is a regular expression to find sequences of hex escapes and replace them with a UTF-8 decoded equivalent:

import re

esc = re.compile(r'(?:\\x[0-9a-f]{2})+')

def fixer(m):
    return m.group().encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')

s = u'Let\u2019s play! \\xe2\\x9a\\xbd\\xef\\xb8\\x8f'
f = esc.sub(fixer,s)
print repr(f)
print f

Output:

u'Let\u2019s play! \u26bd\ufe0f'
Let’s play! ⚽️

answered Oct 11 '17 at 17:00

Mark Tolonen

166,664
26
169
251

This works, thanks! Accepting this answer. Can you explain why you're encoding, decoding, encoding again, and decoding again? I've seen that as part of a lot of unicode solutions and I'm not really clear on why. – Nathan Hinchey Oct 11 '17 at 17:37
1

@NathanHinchey You *encode* to bytes, *decode* to Unicode. `.encode('latin1')` is a trick to translate U+0000 to U+00FF directly to bytes 00-FF since the `latin1` character set is the first 256 codepoints of Unicode. So starting from a Unicode string, turn it to bytes to be able to use the `unicode-escape` codec, but the result is another Unicode string, so encode again to bytes and decode to UTF8. – Mark Tolonen Oct 11 '17 at 17:42

Python, Unicode: Working with badly formatted strings

1 Answers1