Utf-8 decoding with Python

Question

I have a csv with some data, and in one row there is a text that was added after encoding it in utf-8.

This is the text:

"b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'"

I'm trying to use this text to obtain the original characters using the decode function, but it's imposible.

Does anyone know which is the correct procedure to do it?

score 4 · Accepted Answer · answered Feb 21 '18 at 10:48

Assuming that the line in your file is exactly like this:

b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'

And reading the line from the file gives the output:

>>> line
"b'\\xe7\\x94\\xb3\\xe8\\xbf\\xaa\\xe8\\xa5\\xbf\\xe8\\xb7\\xaf255\\xe5\\xbc\\x84660\\xe5\\x8f\\xb7\\xe5\\x92\\x8c665\\xe5\\x8f\\xb7 \\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe4\\xb8\\x8a\\xe6\\xb5\\xb7\\xe6\\xb5\\xa6\\xe4\\xb8\\x9c\\xe6\\x96\\xb0\\xe5\\x8c\\xba 201205'"`

You can try to use eval() function:

with open(r"your_csv.csv", "r") as csvfile:
    for line in csvfile:
        # when you reach the desired line
        b = eval(line).decode('utf-8')

Output:

>>> print(b)
'申迪西路255弄660号和665号 中国上海浦东新区 201205'

What the file contens is : b'\xe7\x94\xb3\xe8\...' and when I read and print is b'\xe7\x94\xb3\xe8' — Madmartigan, Feb 21 '18 at 11:57
Can you show what the actual file looks like? May be from an editor like Notepad++? — abybaddi009, Feb 21 '18 at 11:58
@Madmartigan that is exactly what is meant by this answer using `eval()`, did you try it ? — Edwin van Mierlo, Feb 21 '18 at 11:59

Narendra · Answer 2 · 2018-02-21T10:30:40.330

0

Try this:-

a = b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
print(a.decode('utf-8')) #your decoded output

As you are saying you are reading from file then you can try with passing encoding system when reading:-

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)

edited Feb 21 '18 at 10:30

answered Feb 21 '18 at 10:00

Narendra

1,511
1
10
20

1

I know that works. My problem is that I can not find the way to prepare the string. When I read the row I obtain "b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\..." But I need b'\xe7\x94\xb3\xe8\xbf\xaa\xe8...' – Madmartigan Feb 21 '18 at 10:05
@Madmartigan ok in that case i modified my answer...try with it – Narendra Feb 21 '18 at 10:29
1

@Narendra OP is asking about python-3. It's enough to use `open(path, 'r', encoding='utf-8')`. You don't have to use the codecs module. – viraptor Feb 21 '18 at 10:41

Utf-8 decoding with Python

2 Answers2