8

Python3 changed the unicode behaviour to deny surrogate pairs while python2 not.

There's a question here

But it do not supply a solution on how to remove surrogate pairs in python2 or how to do surrogate escape.

Python3 example:

>>> a = b'\xed\xa0\xbd\xe4\xbd\xa0\xe5\xa5\xbd'
>>> a.decode('utf-8', 'surrogateescape')
'\udced\udca0\udcbd你好'
>>> a.decode('utf-8', 'ignore')
'你好'

The '\xed\xa0\xbd' here is not proper utf-8 chars. And I want to ignore them or escape them.

Is it possible to do the same thing in python2?

Community
  • 1
  • 1
lxyu
  • 2,661
  • 5
  • 23
  • 29
  • What exactly do you want to do? It is not clear. Provide an example. – Mark Tolonen Oct 29 '13 at 04:44
  • @MarkTolonen I have added an example. – lxyu Oct 29 '13 at 07:32
  • I don't see a better way than post-processing the decoded unicode object to remove all characters between '\udc00' and '\udfff'. – Armin Rigo Oct 29 '13 at 08:27
  • @ArminRigo do you have any reference for '\udc00' and '\udfff'? Why are them the boundary? – lxyu Oct 31 '13 at 18:49
  • They are the "high surrogates". See anywhere, e.g. on Wikipedia. I can only side with Mark: it's not clear what you want to do. Do you want equivalent code to do the same as Python 3's decode('utf-8', 'surrogateescape') and decode('utf-8', 'ignore'), but in Python 2? – Armin Rigo Oct 31 '13 at 21:55
  • @ArminRigo yes, it is. – lxyu Nov 04 '13 at 06:10
  • I'm afraid there is no built-in solution. You need to write a function that looks over each character (say of the resulting unicode), looks up which ones are surrogates, and replace them as you need, in order to emulate the behavior that you need. – Armin Rigo Nov 04 '13 at 16:55
  • @lxyu did you find an answer on how to do this? – underrun Apr 21 '14 at 13:39

1 Answers1

5

There is no builtin solution, but there is an implementation of surrogateescapes in python-future: https://github.com/PythonCharmers/python-future

Add from future.utils.surrogateescape import register_surrogateescape to the imports. Then call the method register_surrogateescape() and then you can use the errors='surrogateescape' error handler in encode and decode.

An example can be found here

proski
  • 3,603
  • 27
  • 27