Identify garbage unicode string using python

Question

My script is reads data from csv file, the csv file can have multiple strings of English or non English words.

Some time the text file has garbage strings , i want to identify those string and skip those string and process others

doc = codecs.open(input_text_file, "rb",'utf_8_sig')
fob = csv.DictReader(doc)
for row, entry in enumerate(f):
    if is_valid_unicode_str(row['Name']):
         process_futher

def is_valid_unicode_str(value):
     try:
         function
         return True
     except UnicodeEncodeError:
         return false

csv input:

"Name"
"Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€"
"元大寶來證券"
"John Dove"

I want to defile function is_valid_unicode_str() which will identify the garbage string and process valid one only.

I tried to use decode is but it doesnt failed while decoding garbage strings

value.decode('utf8')

The expected output are string with Chinese and English string to be process

could you please guide me how can i implement function to filter valid Unicode files?.

You have [**Mojibake** strings](https://en.wikipedia.org/wiki/Mojibake). Data that has been, at some point encoded in one encoding and decoded into another. `codecs.open()` will have already decoded these to Unicode strings because your file has been uniformly encoded to UTF-8. — Martijn Pieters, Mar 16 '15 at 08:01
There is a possibility the Mojibake can be *repaired*, rather than discarded. Take a look at [`ftfy`](https://pypi.python.org/pypi/ftfy) to see what it can do to make sense of the strings you have. — Martijn Pieters, Mar 16 '15 at 08:04
I can get `猫垄姑⑩dcx盲赂沤忙姑ヂ姑ぢ宦р得ヂ氓⑩` or `猫垄鈥姑⑩dcx盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р得ヂ氓鈥⑩` or `猫垄鈥姑⑩dcx盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р得ヂ氓鈥⑩` (GB2312, GBK and GB18030) out from your mojibake so far. Will look further. — Martijn Pieters, Mar 16 '15 at 08:15
Unfortunately, FTFY doesn't support the GB* series of codecs; I [filed an exploratory issue](https://github.com/LuminosoInsight/python-ftfy/issues/34) with the project. — Martijn Pieters, Mar 16 '15 at 08:31
For future posts, can you include the output of `print repr(broken_value)`? That way any non-printable and non-ASCII bytes will be included as escape sequences, allowing us to accurately recreate the value. — Martijn Pieters, Mar 16 '15 at 10:22
After talking with the `ftfy` core devs it's clear that my assumption that you have a Mojibake of a GB* encoding here may not be correct; it could also be a *double* CP1252 - UTF8 Mojibake. I'd be really interested to see a `repr()` output of the string here so we have all the bytes. So far my best decode with that assumption comes to `袋dcx与朋们`. — Martijn Pieters, Mar 16 '15 at 19:41

rspeer · Answer 1 · 2015-03-16T21:04:08.027

(ftfy developer here)

I've figured out that the text is likely to be '袋袋与朋友们电子商'. I had to guess at the characters 友, 子, and 商, because some unprintable characters are characters missing in the string in your question. When guessing, I picked the most common character from the small number of possibilities. And I don't know where the "dcx" goes or why it's there.

Google Translate is not very helpful here but it seems to mean something about e-commerce.

So here's everything that happened to your text:

It was encoded as UTF-8 and decoded incorrectly as sloppy-windows-1252, twice
It had the letters "dcx" inserted into the middle of a UTF-8 sequence
Characters that don't exist in windows-1252 -- with byte values 81, 8d, 8f, 90, and 9d -- were removed
A non-breaking space (byte value a0) was removed from the end

If just the first problem had happened, ftfy.fix_text_encoding would be able to fix it. It's possible that the remaining problems just happened while you were trying to get the string onto Stack Overflow.

So here's my recommendation:

Find out who keeps decoding the data incorrectly as sloppy-windows-1252, and get them to decode it as UTF-8 instead.
If you end up with a string like this again, try ftfy.fix_text_encoding on it.

Martijn Pieters · Accepted Answer · 2019-01-17T12:01:29.973

You have Mojibake strings; text encoded to one (correct) codec, then decoded as another.

In this case, your text was decoded with the Windows 1252 codepage; the U+20AC EURO SIGN in the text is typical of CP1252 Mojibakes. The original encoding could be one of the GB* family of Chinese encodings, or a multiple roundtrip UTF-8 - CP1252 Mojibake. Which one I cannot determine, I cannot read Chinese, nor do I have your full data; CP1252 Mojibakes include un-printable characters like 0x81 and 0x8D bytes that might have gotten lost when you posted your question here.

I'd install the ftfy project; it won't fix GB* encodings (I requested the project add support), but it includes a new codec called sloppy-windows-1252 that'll let you reverse an erroneous decode with that codec:

>>> import ftfy  # registers extra codecs on import
>>> text = u'Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€'
>>> print text.encode('sloppy-windows-1252').decode('gb2312', 'replace')
猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
>>> print text.encode('sloppy-windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'replace')
袋�dcx与朋�们���

The � U+FFFD REPLACEMENT CHARACTER shows the decoding wasn't entirely successful, but that could be due to the fact that your copied string here is missing anything not printable or using the 0x81 or 0x8D bytes.

You can try to fix your data this way; from the file data, try to decode to one of the GB* codecs after encoding to sloppy-windows-1252, or roundtrip from UTF-8 twice and see what fits best.

If that's not good enough (you cannot fix the data) you can use the ftfy.badness.sequence_weirdness() function to try and detect the issue:

>>> from ftfy.badness import sequence_weirdness
>>> sequence_weirdness(text)
9
>>> sequence_weirdness(u'元大寶來證券')
0
>>> sequence_weirdness(u'John Dove')
0

Mojibakes score high on the sequence weirdness scale. You'd could try and find an appropriate threshold for your data by which time you'd call the data most likely to be corrupted.

However, I think we can use a non-zero return value as a starting point for another test. English text should score 0 on that scale, and so should Chinese text. Chinese mixed with English can still score over 0, but you could not then encode that Chinese text to the CP-1252 codec while you can with the broken text:

from ftfy.badness import sequence_weirdness

def is_valid_unicode_str(text):
    if not sequence_weirdness(text):
        # nothing weird, should be okay
        return True
    try:
        text.encode('sloppy-windows-1252')
    except UnicodeEncodeError:
        # Not CP-1252 encodable, probably fine
        return True
    else:
        # Encodable as CP-1252, Mojibake alert level high
        return False

@Shashi: so presumably the encode-to-sloppy-cp1252-then-decode worked for you? — Martijn Pieters, Mar 16 '15 at 10:21

Identify garbage unicode string using python

2 Answers2