Removing xml unicode characters from strings

Question

I'm struggling to remove xml unicode characters from strings. Adapting this solution for Python 3 fails:

s = 'foo&#x421;&#x44A;&#x431;bar'
s.encode('ascii', errors='ignore')
# b'foo&#x421;&#x44A;&#x431;bar'

I've also tried unescaping with xml.sax.saxutils but with no luck:

unescape(s).encode('ascii', errors='ignore')
# b'foo&#x421;&#x44A;bar'

Any suggestions appreciated.

Do you want to competely remove them, or just translate them correctly? `print(html.unescape(s))` gives `fooСъбbar`. — Mark Tolonen, Apr 01 '21 at 23:37

score 1 · Accepted Answer · answered Apr 01 '21 at 12:02

1

You might harness html.unescape for this task

import html
s = 'foo&#x421;&#x44A;&#x431;bar'
s2 = html.unescape(s).encode('ascii', errors='ignore')
print(s2)

output:

b'foobar'

answered Apr 01 '21 at 12:02

Daweo

What if the XML escape represents an ASCII character?. It will be replaced, not removed. – Mark Tolonen Apr 01 '21 at 23:38

1 Answers1