Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?

Question

How to replace characters that cannot be decoded using utf8 with whitespace?

# -*- coding: utf-8 -*-
print unicode('\x97', errors='ignore') # print out nothing
print unicode('ABC\x97abc', errors='ignore') # print out ABCabc

How can I print out ABC abc instead of ABCabc? Note, \x97 is just an example character. The characters that cannot be decoded are unknown inputs.

If we use errors='ignore', it will print out nothing.
If we use errors='replace', it will replace that character with some special chars.

HelloWorld · Answer 1 · 2015-08-20T10:55:48.703

9

Take a look at codecs.register_error. You can use it to register custom error handlers

https://docs.python.org/2/library/codecs.html#codecs.register_error

import codecs
codecs.register_error('replace_with_space', lambda e: (u' ',e.start + 1))
print unicode('ABC\x97abc', encoding='utf-8', errors='replace_with_space')

edited Aug 20 '15 at 10:55

answered Aug 20 '15 at 10:42

HelloWorld

2,392
3
31
68

Does stack overflow allow more than 1 solution? both @Kasramvd and you provide excellent answers... what to do in this case.. – DehengYe Aug 20 '15 at 11:14

Mazdak · Accepted Answer · 2015-08-20T10:40:19.567

3

You can use a try-except statement to handle the UnicodeDecodeError :

def my_encoder(my_string):
   for i in my_string:
      try :
         yield unicode(i)
      except UnicodeDecodeError:
         yield '\t' #or another whietespaces

And then use str.join method to join your string :

print ''.join(my_encoder(my_string))

Demo :

>>> print ''.join(my_encoder('this is a\x97n exam\x97ple'))
this is a   n exam  ple

edited Aug 20 '15 at 10:40

answered Aug 20 '15 at 10:34

Mazdak

105,000
18
159
188

\x97 is just an example character. The characters that cannot be decoded are unknown inputs. – DehengYe Aug 20 '15 at 10:38
@DehengYe Just a typo, fixed – Mazdak Aug 20 '15 at 10:47
very helpful answer! @Kasramvd – DehengYe Aug 20 '15 at 11:01
I hope you don't mind. Both you and @HelloWorld provide excellent answers. But Stack Overflow allows only one solution. – DehengYe Aug 20 '15 at 11:22

Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?

2 Answers2

Linked