0

I'm struggling to remove xml unicode characters from strings. Adapting this solution for Python 3 fails:

s = 'fooСъбbar'
s.encode('ascii', errors='ignore')
# b'fooСъбbar'

I've also tried unescaping with xml.sax.saxutils but with no luck:

unescape(s).encode('ascii', errors='ignore')
# b'fooСъbar'

Any suggestions appreciated.

geotheory
  • 22,624
  • 29
  • 119
  • 196
  • Do you want to competely remove them, or just translate them correctly? `print(html.unescape(s))` gives `fooСъбbar`. – Mark Tolonen Apr 01 '21 at 23:37

1 Answers1

1

You might harness html.unescape for this task

import html
s = 'fooСъбbar'
s2 = html.unescape(s).encode('ascii', errors='ignore')
print(s2)

output:

b'foobar'
Daweo
  • 31,313
  • 3
  • 12
  • 25