Encode Decode using python

Question

I have this function in python

Str = "Ã¼";
print Str


def correctText( str ):
  str = str.upper()
  correctedText = str.decode('UTF8').encode('Windows-1252')
  return correctedText; 

corText = correctText(Str); 
print corText

It works and converts characters like Ã¼ and Ã© however it fails when i try Ã? and Â¶

Is there a way i can fix it?

Is that Python2 or Python3? If (as using `print` statement suggests) it's 2, how is your source file encoding declared? — Błotosmętek, Jul 05 '17 at 14:56

score 0 · Answer 1 · answered Jul 05 '17 at 14:59

0

According to UTF8, Ã and Â¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function. I recommend using the ladder.

answered Jul 05 '17 at 14:59

PhoenixFireFlite

135
1
8

Thanks for your reply. Do you know what kind of encoding those characters belong to ? – Maria C Jul 05 '17 at 16:07

score 0 · Answer 2 · answered Jul 07 '17 at 17:41

What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes.

For example, for Ã¼, the Windows-1252 code of Ã is C3 and for ¼ it's BC. Together the code C3BC happens to be the UTF-8 code of ü.

Now, for Ã?, the Windows-1252 code is C33F, which is not a valid UTF-8 code (because the second byte does not start with 10).

Are you sure this sequence occurs in your text? For example, for à, the Windows-1252 decoding of the UTF-8 code (C3A0) is Ã followed by a non-printable character (non-breaking space). So, if this second character is not printed, the ? might be a regular character of the text.

For Â¶ the Windows-1252 encoding is C2B6. Shouldn't it be Ã¶, for which the Windows-1252 encoding is C3B6, which equals the UTF-8 code of ö?

Encode Decode using python

2 Answers2