error while decoding html and farsi from hex encoding in python

Question

I have some string in hex enocoding like this:

data = \xd8\xa7\xdb\x8c \xd9\x84\xda\x86\xdb\x8c<br/> \xd8\xa7\xda\xaf\xd8\xb1\xda\x86\xd9\x87 \xd8\xa7\xd9\x82\xd8\xaf\xd8\xa7\xd9\x85\xd8\xa7\xd8\xaa

It contains some Persian string and some HTML elements.

Using ddcode.com I convert them and get meaningful results(I'm not sure that the string is in hex!), but when I want to decode the strings by python I always get errors.

Using codec: codecs.decode(data,'hex',errors='ignore')

I get

AssertionError                            Traceback (most recent call last)
<ipython-input-124-5246163fba41> in <module>()
----> 1 codecs.decode(data,'hex',errors='ignore')

AssertionError: decoding with 'hex' codec failed (AssertionError: )

Using binascii: binascii.unhexlify(data)

I get:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-126-fbe8c6445b8a> in <module>()
      1 import binascii
----> 2 binascii.unhexlify(data)

ValueError: string argument should contain only ASCII characters.

What is your suggestion? is the string in hex? if there is some none hex in the string how can I ignore them during decoding?

score 0 · Answer 1 · answered Apr 07 '17 at 07:45

0

is the string in hex?

No, it's in bytes, using a broken encoding.

>>> 'data  \xd8\xa7\xdb\x8c \xd9\x84\xda\x86\xdb\x8c<br/> \xd8\xa7\xda\xaf\xd8\xb1\xda\x86\xd9\x87 \xd8\xa7\xd9\x82\xd8\xaf\xd8\xa7\xd9\x85\xd8\xa7\xd8\xaa'.encode('latin-1').decode('utf-8')
'data  ای لچی<br/> اگرچه اقدامات'

answered Apr 07 '17 at 07:45

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

your code works for `data` but it doesn't work for some other string! for example: `('\\r\\n \\xd8\\xa7\\xd8\\xb5\\xd9\\x84 44 \\xd9\\xbe\\xd8\\xb3 \\xd8\\xa7\\xd8\\xb2 21 \\xd9\\x85\\xd8\\xa7\\xd9\\x87 \\r\\n ')` is something wrong with the latter? why it is broken? – Mehdi Apr 07 '17 at 07:56
That text is *further* encoded incorrectly. Decode with `unicode-escape`. `>>> '\\r\\n \\xd8\\xa7\\xd8\\xb5\\xd9\\x84 44 \\xd9\\xbe\\xd8\\xb3 \\xd8\\xa7\\xd8\\xb2 21 \\xd9\\x85\\xd8\\xa7\\xd9\\x87 \\r\\n '.encode('latin-1').decode('unicode-escape').encode('latin-1').decode('utf-8')` `'\r\n اصل 44 پس از 21 ماه \r\n '` – Ignacio Vazquez-Abrams Apr 07 '17 at 07:58
It seems that your code works perfectly! Can you explain how it works? – Mehdi Apr 07 '17 at 09:05

error while decoding html and farsi from hex encoding in python

1 Answers1