Representing non-English characters with Unicode (UTF-8)

Question

I am working with an HTML string in Python that contains non-English characters that is represented in the string by 16-bit unicode hex values. The string reads:

"Skr\u00E4ddarev\u00E4gen"

The string when properly converted should read "Skräddarevägen". How do i ensure that the unicode hex value gets correctly encoded/decoded on output and reads with the correct accents?

(Note, I'm using Requests and Pandas and the encoding in both is set to utf-8) Thanks in advance!

score 4 · Answer 1 · answered Aug 09 '19 at 20:04

4

In Python 3, the following can happen:

If you pick up your string from an HTML file, you have to read in the HTML file using the correct encoding.
If you have your string in Python 3 code, it should be already in Unicode (32-bit) in memory.

Write the string out to a file, you have to specify the encoding you want in the file open.

answered Aug 09 '19 at 20:04

朱梅寧

139
1
4

This seems to be so automatic and hand-free in Python 3. We are still using Python 2.7 and I'll try the same. – Li Li Aug 11 '19 at 00:41

score 0 · Answer 2 · answered Jan 02 '18 at 23:15

From your display, it is hard to be sure what is in the string. Assuming that it is the 24 characters displayed, I believe the last line of the following answers your question.

s = "Skr\\u00E4ddarev\\u00E4gen"
print(len(s))
for c in s: print(c, end=' ')
print()
print(eval("'"+s+"'"))
print(eval("'"+s+"'").encode('utf-8'))

This prints

24
S k r \ u 0 0 E 4 d d a r e v \ u 0 0 E 4 g e n 
Skräddarevägen
b'Skr\xc3\xa4ddarev\xc3\xa4gen'

score 0 · Accepted Answer · answered Jan 03 '18 at 03:15

If you are using Python 3 and that is literally the content of the string, it "just works":

>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skräddarevägen'

If you have that string as raw data, you have to decode it. If it is a Unicode string you'll have to encode it to bytes first. The final result will be Unicode. If you already have a byte string, skip the encode step.

>>> s = r"Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.encode('ascii').decode('unicode_escape')
'Skräddarevägen'

If you are on Python 2, you'll need to decode, plus print to see it properly:

>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.decode('unicode_escape')
u'Skr\xe4ddarev\xe4gen'
>>> print s.decode('unicode_escape')
Skräddarevägen

The .encode('ascii').decode('unicode_escape') appears to have solved the issue. Thanks for your help! — George Mathias, Jan 03 '18 at 22:33
@George Welcome to SO! If an answer is useful, please upvote it, and if it is the best answer, select the check to the left to accept it as the answer. — Mark Tolonen, Jan 04 '18 at 00:01

Representing non-English characters with Unicode (UTF-8)

3 Answers3