Python 3 : Converting UTF-8 unicode Hindi Literal to Unicode

Question

I have a string of UTF-8 literals

'\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2' which covnverts to

ही बोल in Hindi. I am unable convert string a to bytes

a = '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'
#convert a to bytes
#also tried a = bytes(a,'utf-8')
a = a.encode('utf-8')
s = str(a,'utf-8')

The string is converted to bytes but contains wrong unicode literals

RESULT : b'\xc3\xa0\xc2\xa4\xc2\xb9\xc3\xa0\xc2\xa5\xc2\x80 \xc3\xa0\xc2\xa4\xc2\xac\xc3\xa0\xc2\xa5\xc2\x8b\xc3\xa0\xc2\xa4\xc2\xb2' which prints - à¤¹à¥ à¤¬à¥à¤²

EXPECTED : It should be b'\xe0\xa4\xb9\xe0\xa5\x80\xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2 which will be ही बोल

What are you trying to achieve? You have bytes (a UTF-8 encoded string). What do you want to do with it? Is `b'\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'.decode('utf8')` what you are looking for? — Codo, Dec 14 '19 at 13:22
It is wrong bytes string, It should be b'\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2 — Mohit Kumar, Dec 14 '19 at 13:23
So your starting point is the string "ही बोल"? If so you might be looking for `'ही बोल'.encode('utf-8')`. — Codo, Dec 14 '19 at 13:33
'ही बोल is not starting point. This is '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2' ही बोल is end expected result — Mohit Kumar, Dec 14 '19 at 13:37
I don't get it. Your starting point and your result seems to be the same. If so, no processing is needed. It might be helpful if you provided the bigger context: Where does the data come from and in what format? Where does need to go and in what format? — Codo, Dec 14 '19 at 13:41
I get this '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2' and want to get ही बोल — Mohit Kumar, Dec 14 '19 at 13:44

snakecharmerb · Accepted Answer · 2019-12-14T14:13:57.180

1

Use the raw-unicode-escape codec to encode the string as bytes, then you can decode as UTF-8.

>>> s = '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'
>>> s.encode('raw-unicode-escape').decode('utf-8')
'ही बोल'

This is something of a workaround; the ideal solution would be to prevent the source of the data stringifying the original bytes.

edited Dec 14 '19 at 14:13

answered Dec 14 '19 at 13:47

snakecharmerb

47,570
11
100
153

score 1 · Answer 2 · answered Dec 15 '19 at 02:17

Your original string was likely decoded as latin1. Decode it as UTF-8 instead if possible, but if received messed up you can reverse it by encoding as latin1 again and decoding correctly as UTF-8:

>>> s = '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'
>>> s.encode('latin1').decode('utf8')
'ही बोल'

Note that latin1 encoding matches the first 256 Unicode code points, so U+00E0 ('\xe0' in a Python 3 str object) becomes byte E0h (b'\xe0' in a Python 3 bytes object). It's a 1:1 mapping between U+0000-U+00FF and bytes 00h-FFh.

Python 3 : Converting UTF-8 unicode Hindi Literal to Unicode

2 Answers2

Linked