4

When you have incorrectly decoded characters, how can you identify likely candidates for the original string?

Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png

I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.

Is the corruption reversible?

wim
  • 338,267
  • 99
  • 616
  • 750

1 Answers1

5

You could use chardet (install with pip):

import chardet

your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]

try:
    correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
    print("Could not estimate encoding")

Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)

For Python 3 (source file encoded as utf8):

import chardet
import codecs

falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"

try:
    encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
    print("could not encode falsely decoded string")
    encoded_str = None

if encoded_str:
    detected_encoding = chardet.detect(encoded_str)["encoding"]

    try:
        correct_str = encoded_str.decode(detected_encoding)
    except UnicodeEncodeError:
        print("could not decode encoded_str as %s" % detected_encoding)

    with codecs.open("output.txt", "w", "utf-8-sig") as out:
        out.write(correct_str)

In summary:

>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'
wim
  • 338,267
  • 99
  • 616
  • 750
galinden
  • 610
  • 8
  • 13
  • 2
    Google Translate says "Test time point of view (animation path) _10 seconds", looks almost sensible! – Matteo Italia Jun 10 '14 at 12:52
  • This looks very cool and promising, but I am getting a different result to you. Do you run code just as shown here? Which python and chardet version numbers? – wim Jun 10 '14 at 22:15
  • I ran this on Python 2.7.7 + chardet 2.2.1. With Python 3 this definitely does not work, I'll have a look at how to do that. – galinden Jun 11 '14 at 08:25
  • 4
    As this above includes a byte string with non-ASCII characters, what happens depends on the encoding you save the source file in. As the string `Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb` is the result of a code page 932-encoded (Shift-JIS-like) string being misinterpreted as code page 850 (DOS Western European), the source above would have to be saved as cp850 to work. – bobince Jun 11 '14 at 09:08
  • I have added a solution for python 3 above thanks to the comment of @bobince – galinden Jun 11 '14 at 09:35
  • The python3 version works for me, but @bobince how did you know those encodings? That is the guts of this question, and in this current answer, the important part ("cp850") is hardcoded! – wim Jun 11 '14 at 10:05
  • With this type of encoding problem, the encoding would also be constant, it is safe to assume cp850 in situations where the python 2 implementation would work. It is the actual encoding (SHIFT_JIS) that you have to estimate. There is no guaranteed way to determine encoding, especially not if false encoding is first applied anyway. I think the above should help in most common cases, else I look forward to more examples. – galinden Jun 11 '14 at 11:26
  • The selection of accented characters tells me you have a Western European encoding and the presence of the box drawing character `▒` implies a DOS encoding, so put those together and you probably have 850 (or conceivably 437). "Japanese" implies that the encoding that should have been used instead is either a UTF, or one of the Japanese-specific MBCSs. As what you have is probably 850, you are probably using Windows, so likely it's the old Windows default code page for Japanese-locale installations, 932. – bobince Jun 11 '14 at 11:31
  • 1
    This was my heuristic approach based on previous experience; clearly it doesn't translate practically to automated detection. What you would probably do if trying to solve programmatically would be to have a rating function for "looks most like the target language" and try encoding and re-decoding using all combinations of the encodings Python knows about in turn. The unmangled text may not always be recoverable: there are many encodings in which not every byte maps to a character, so characters that used those bytes in the real-encoding would be lost. – bobince Jun 11 '14 at 11:35