Comprehensive character replacement module in python for non-unicode and non-ascii for HTML

Question

Is there a comprehensive character replacement module for python that finds all non-ascii or non-unicode characters in a string and replaces them with ascii or unicode equivilents? This comfort with the "ignore" argument during encoding or decoding is insane, but likewise so is a '?' in every place that a non translated character was.

I'm looking for one module that finds irksome characters and conforms them to whatever standard is requested. I realize that the amount of extant alphabets and encodings makes this somewhat impossible, but surely someone has taken a stab at it? Even a rudimentary solution would be better than the status quo.

The simplification for data transfer that this would mean is enormous.

Can you give some conrete examples of what you wish to happen? How would the result look like? — Pablo H, Jan 27 '23 at 15:31

score 4 · Answer 1 · answered Oct 17 '12 at 22:50

i don't think what you want is really possible - but i think there is a decent option.

unicodedata has a 'normalize' method that can gracefully degrade text for you...

import unicodedata
def gracefully_degrade_to_ascii( text ):
    return unicodedata.normalize('NFKD',text).encode('ascii','ignore')

assuming the charset you're using is already mapped into unicode - or at least can be mapped into unicode - you should be able to degrade the unicode version of that text down to ascii or utf-8 with this module ( it's part of the standard library too )

Full Docs - http://docs.python.org/library/unicodedata.html

score 0 · Answer 2 · answered Oct 11 '12 at 00:06

0

To look at any individual character and guess its encoding would be hard and probably not very accurate. However, you can use chardet to try and detect the encoding of an entire file. Then you can use the string decode() and encode() methods to convert its encoding to UTF-8.

http://pypi.python.org/pypi/chardet

And UTF-8 is backwards compatible with ASCII so that won't be a big deal.

answered Oct 11 '12 at 00:06

JBoyer

43
6

Why is it hard to look at any individual character and guess its encoding? – x - y Oct 12 '12 at 17:34
@x-y because the same codes can be used for totally different languages. With more info it may be easier though. – Luke Stanley Dec 19 '13 at 03:10

Comprehensive character replacement module in python for non-unicode and non-ascii for HTML

2 Answers2

Linked