2

I've got some data stored as strings that contains both unicode characters (e.g., ñ) and unicode escape sequences (e.g., \u00F1). I would like to do a string-to-string transformation that converts the escape sequences into the corresponding unicode characters. For example, if the string is s = r'\u00F1ñ', I would like the output to be 'ññ'.

The closest I've found so far is s.encode().decode('unicode-escape'): this converts the escape sequences, but mangles any unicode characters already present.

Please note that this question is for python 3.

Jolyon
  • 165
  • 1
  • 7
  • For me, making it a regular string and using the default `utf8` encoding and decoding got the desired result. – duckboycool Aug 06 '20 at 01:52
  • Post code snippet? I've tried doing similar things... – Jolyon Aug 06 '20 at 01:55
  • Note that the r in front of the string is mandatory to get the correct representation of the data that I have. – Jolyon Aug 06 '20 at 01:55
  • 2
    If I'm not mistaken, the "unicode-escape" codec assumes Latin-1 for non-escaped characters. So you could try `s.encode('latin-1').decode('unicode-escape')`. Note: this will only work if your non-escaped characters have a codepoint below 256. – lenz Aug 06 '20 at 14:52
  • @lenz Indeed, that does it! `r'\u00F1ñ'.encode('latin-1').decode('unicode-escape')` returns `'ññ'` as desired. Feel free to make an answer to this extent so I can accept it. – Jolyon Aug 06 '20 at 17:47

0 Answers0