Why does my decoded Windows-1252 string show up as a unicode value in a dictionary but not the value, although I try to decode it as UTF-8?

Question

In my application - following Ned Batchelder's recommendations of making a unicode sandwich - I first try to decode from Windows-1252 to UTF-8:

row[field] =row[field].decode('cp1252').encode('utf-8')

Later on, when I want to send my data to an endpoint I decode UTF-8:

row[field] = fld.decode('utf-8')

When I print just the field that has the offending Windows-1252 characters, it interprets them as such:

print row['dash']
# as well — ... “the intent was"

But when I try to print the entire dictionary I get unicode values:

print row
# as well \xe2\x80\x93 ... \xe2\x80\x9cthe intent was\xe2\x80\x9d

I want the wp-1252 characters themselves or equivalents such as the straight quotation mark instead of the left or right quotation mark.

When you are printing out a dictionary, the internal representation is shown, which is UTF-8. — Maurice Meyer, Nov 21 '18 at 18:13
@MauriceMeyer you're right. Can you add this as an answer so I can accept it? — Stepharr, Nov 21 '18 at 22:59
Your "sandwich" sounds backward. You `.decode()` to Unicode when reading in data to a program for processing, then `.encode()` to bytes to send it to a file or pipe. Databases "usually" can accept Unicode and are configured with an encoding that happens automatically when the database API puts the data in the database so you can skip the `.encode()` step if that's the case. — Mark Tolonen, Nov 22 '18 at 07:52

0 Answers0