Decoding a String that Contains Encoded Characters

Question

I have some strings that I am pasting in to my script as test data. The strings come from emails that contain encoded characters and it's throwing a SyntaxError. So far, I have not been able to find a solution to this issue. When I print repr(string), I get these strings:

'Total Value for 1st Load \xe2\x80\x93 approx. $75,200\n'
'Total Value for 2nd Load \xe2\x80\x93 approx. $74,300\n'

And this error pops up when I run my script:

SyntaxError: Non-ASCII character '\xe2' in file <filename> on line <line number>, but no 
encoding declared; see http://www.python.org/peps/pep-2063.html

When I just print the lines containing the encoded characters I get this:

'Total Value for 2nd Load â€“ approx. $74,300'

The data looks like this when I copy it from the email:

'Total Value for 1st Load – approx. $75,200'
'Total Value for 2nd Load – approx. $74,300'

From doing my searches, I believe it's encoded with utf-8, but I have no idea how to work with this data based on the fact that some characters are encoded, but most of them are not(maybe?). I have tried varying "solutions" I have found so far. Including adding # -*- coding: utf-8 -*- to the top of my script and the script just hangs... It doesn't do anything :(

If someone could provide some information on how to deal with this scenario, that would be amazing.

I have tried decoding and encoding using string.encode() and string.decode(), using different encoding based on what I could find on Google, but that hasn't solved the problem.

I would really prefer a python solution because the project I'm working on requires people to paste data into a textfield in a GUI, and then that data will be processed. Other solutions suggested pasting the data into something like word, or notepad, saving it as plain text, then doing another copy/paste back from that file. This is a bit much. Does anybody know of a pythonic way of dealing with this issue?

*All* your characters are encoded. It just so happens that the first 128 characters encoded by UTF-8 are the *exact same characters* encoded by ASCII. So `T` is `\x84` in both ASCII and UTF-8 and Python always shows ASCII characters instead of the byte value. — Martijn Pieters, Oct 08 '14 at 21:22
What you see when printing is called a [Mojibake](http://en.wikipedia.org/wiki/Mojibake); UTF-8 bytes interpreted wrongly because your console is probably set to Windows Codepage 1252. — Martijn Pieters, Oct 08 '14 at 21:34
Adding `# -*- coding: utf-8 -*-` as the first or second line of your source should have worked, and definitely shouldn't have made anything hang. There's something you're not telling us. — Mark Ransom, Oct 08 '14 at 21:45
@MarkRansom, Well, I don't know what that might be... I got a pop-up in idle telling me to change the encoding and I hit Ok, and now it hangs no matter what. — RattleyCooper, Oct 08 '14 at 22:05
Ok, after adding a bunch of print statements I have concluded that you are correct. I was so convinced it was an encoding issue that was causing it to hang that I didn't look past my own assumptions. Probably should have taken a break. Thanks! — RattleyCooper, Oct 08 '14 at 22:15
Always happy when I can be helpful. I take it the problem described in the question is solved then? — Mark Ransom, Oct 08 '14 at 22:17
I actually have to test here in a couple hours. My job requires me to do other things right now. Kind of frustrating when these types of things come up, but I gotta do what I gotta do. I believe it is solved and will mark the answer down as soon as I confirm it. I am almost 100% sure this is the correct fix though. — RattleyCooper, Oct 08 '14 at 22:25

score 1 · Accepted Answer · answered Oct 08 '14 at 21:23

1

>>> msg = 'Total Value for 1st Load \xe2\x80\x93 approx. $75,200\n'
>>> print msg.decode("utf-8")
Total Value for 1st Load – approx. $75,200

make sure you use something like idle that can support these characters (IE dos terminal probably will not!)

answered Oct 08 '14 at 21:23

Joran Beasley

110,522
12
160
179

Ok, I had some other mistakes in my code that made me think that the encoding wasn't working, but it was my mistake. This is the correct answer. – RattleyCooper Oct 09 '14 at 00:03

Decoding a String that Contains Encoded Characters

1 Answers1