0

I am working with an HTML string in Python that contains non-English characters that is represented in the string by 16-bit unicode hex values. The string reads:

"Skr\u00E4ddarev\u00E4gen"

The string when properly converted should read "Skräddarevägen". How do i ensure that the unicode hex value gets correctly encoded/decoded on output and reads with the correct accents?

(Note, I'm using Requests and Pandas and the encoding in both is set to utf-8) Thanks in advance!

3 Answers3

4

In Python 3, the following can happen:

  1. If you pick up your string from an HTML file, you have to read in the HTML file using the correct encoding.
  2. If you have your string in Python 3 code, it should be already in Unicode (32-bit) in memory.

Write the string out to a file, you have to specify the encoding you want in the file open.

朱梅寧
  • 139
  • 1
  • 4
  • This seems to be so automatic and hand-free in Python 3. We are still using Python 2.7 and I'll try the same. – Li Li Aug 11 '19 at 00:41
0

From your display, it is hard to be sure what is in the string. Assuming that it is the 24 characters displayed, I believe the last line of the following answers your question.

s = "Skr\\u00E4ddarev\\u00E4gen"
print(len(s))
for c in s: print(c, end=' ')
print()
print(eval("'"+s+"'"))
print(eval("'"+s+"'").encode('utf-8'))

This prints

24
S k r \ u 0 0 E 4 d d a r e v \ u 0 0 E 4 g e n 
Skräddarevägen
b'Skr\xc3\xa4ddarev\xc3\xa4gen'
Terry Jan Reedy
  • 18,414
  • 3
  • 40
  • 52
0

If you are using Python 3 and that is literally the content of the string, it "just works":

>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skräddarevägen'

If you have that string as raw data, you have to decode it. If it is a Unicode string you'll have to encode it to bytes first. The final result will be Unicode. If you already have a byte string, skip the encode step.

>>> s = r"Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.encode('ascii').decode('unicode_escape')
'Skräddarevägen'

If you are on Python 2, you'll need to decode, plus print to see it properly:

>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.decode('unicode_escape')
u'Skr\xe4ddarev\xe4gen'
>>> print s.decode('unicode_escape')
Skräddarevägen
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • The .encode('ascii').decode('unicode_escape') appears to have solved the issue. Thanks for your help! – George Mathias Jan 03 '18 at 22:33
  • @George Welcome to SO! If an answer is useful, please upvote it, and if it is the best answer, select the check to the left to accept it as the answer. – Mark Tolonen Jan 04 '18 at 00:01