7

[Summary]: The data grabbed from the file is

b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"

How to decode these bytes into readable Chinese characters please?

======

I extracted some game scripts from an exe file. The file is packed with Enigma Virtual Box and I unpacked it.

Then I'm able to see the scripts' names just right, in English, as it supposed to be.

In analyzing these scripts, I get an error looks like this:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte

I changed the decoding to GBK, and the error disappeared.

But the output file is not readable. It includes readable English characters and non-readable content which supposed to be in Chinese. Example:

chT0002>pDIӘIʆ

I tried different encodings for saving the file and they show the same result, so the problem might be on the decoding part.

The data grabbed from the file is

b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"

I tried many ways but I just can't decode these bytes into readable Chinese characters. Is there anything wrong with the file itself? Or somewhere else? I really need help, please.

One of the scripts are attached here.

Naojiang Hu
  • 73
  • 1
  • 3

1 Answers1

5

In order to reliably decode bytes, you must know how the bytes were encoded. I will borrow the quote from the python codecs docs:

Without external information it’s impossible to reliably determine which encoding was used for encoding a string.

Without this information, there are ways to try and detect the encoding (chardet seems to be the most widely-used). Here's how you could approach that.

import chardet

data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
detected = chardet.detect(data)
decoded = data.decode(detected["encoding"])

The above example, however, does not work in this case because chardet isn't able to detect the encoding of these bytes. At that point, you'll have to either use trial-and-error or try other libraries.

One method you could use is to simply try every standard encoding, print out the result, and see which encoding makes sense.

codecs = [
    "ascii", "big5", "big5hkscs", "cp037", "cp273", "cp424", "cp437", "cp500", "cp720", 
    "cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858", "cp860",
    "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875",
    "cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1125", "cp1140", "cp1250",
    "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257",
    "cp1258", "cp65001", "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312",
    "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2",
    "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1",
    "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6", "iso8859_7",
    "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_11", "iso8859_13", "iso8859_14",
    "iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_t", "koi8_u", "kz1048",
    "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman",
    "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", "shift_jisx0213",
    "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7",
    "utf_8", "utf_8_sig",
]

data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"

for codec in codecs:
    try:
        print(f"{codec}, {data.decode(codec)}")
    except UnicodeDecodeError:
        continue

Output

cp037, nC«^ýËfimb«[
cp273, nC«¢ýËfimb«¬
cp437, ò├è░ìsåëöéè║
cp500, nC«¢ýËfimb«¬
cp720, ـ├è░së¤éè║
cp737, Χ├Λ░ΞsΗΚΦΓΛ║
cp775, Ģ├Ŗ░ŹsåēöéŖ║
cp850, ò├è░ìsåëöéè║
cp852, Ľ├Ő░ŹsćëöéŐ║
cp855, Ћ├і░ЇsєЅћѓі║
cp856, ץ├ך░םsזיפגך║
cp857, ò├è░ısåëöéè║
cp858, ò├è░ìsåëöéè║
cp860, ò├è░ìsÁÊõéè║
cp861, þ├è░Þsåëöéè║
cp862, ץ├ך░םsזיפגך║
cp863, Ï├è░‗s¶ëËéè║
cp864, ¼ﺃ├٠┌s│┬½∙├ﻑ
cp865, ò├è░ìsåëöéè║
cp866, Х├К░НsЖЙФВК║
cp875, nCα£δΉfimbας
cp949, 빩뒺뛱냹봻듆
cp1006, ﺣﺍsﭦ
cp1026, nC«¢`Ëfimb«¬
cp1125, Х├К░НsЖЙФВК║
cp1140, nC«^ýËfimb«[
cp1250, •ĂŠ°Ťs†‰”‚Šş
cp1251, •ГЉ°Ќs†‰”‚Љє
cp1256, •أٹ°چs†‰”‚ٹ؛
gbk, 暶姲峴唹攤姾
gb18030, 暶姲峴唹攤姾
latin_1, ðsº
iso8859_2, ðsş
iso8859_4, ðsē
iso8859_5, УАsК
iso8859_7, Γ°sΊ
iso8859_9, ðsº
iso8859_10, ðsš
iso8859_11, รฐsบ
iso8859_13, ưsŗ
iso8859_14, ÃḞsẃ
iso8859_15, ðsº
iso8859_16, ðsș
koi8_r, ∙ц┼╟█s├┴■┌┼╨
koi8_u, ∙ц┼╟█s├┴■┌┼╨
kz1048, •ГЉ°Қs†‰”‚Љғ
mac_cyrillic, Х√К∞НsЖЙФВКЇ
mac_greek, ïΟäΑçsÜâî²äΚ
mac_iceland, ï√ä∞çsÜâîÇä∫
mac_latin2, ē√äįćsÜČĒāäļ
mac_roman, ï√ä∞çsÜâîÇä∫
mac_turkish, ï√ä∞çsÜâîÇä∫
ptcp154, •ГҠ°ҚsҶү”ӮҠә
shift_jis_2004, 陛寛行̹狽桓
shift_jisx0213, 陛寛行̹狽桓
utf_16, 쎕낊玍覆芔몊
utf_16_be, 闃誰赳蚉钂誺
utf_16_le, 쎕낊玍覆芔몊

Edit: After running all of the seemingly legible results through Google Translate, I suspect this encoding is UTF-16 big-endian. Here's the results:

Encoding Decoded Language Detected English Translation
gbk 暶姲峴唹攤姾 Chinese Jian Xian JiaoTanJiao
gb18030 暶姲峴唹攤姾 Chinese Jian Xian Jiao Tan Jiao
utf_16 쎕낊玍覆芔몊 Korean None
utf_16_be 闃誰赳蚉钂誺 Chinese Who is the epiphysis?
utf_16_le 쎕낊玍覆芔몊 Korean None
Sam Morgan
  • 2,445
  • 1
  • 16
  • 25
  • Thanks for helping, this is a really good way to test encodings. Unfortunately none of these results are in Chinese, but it clearly shows that the problem should be on the file itself. I'll go and try another unboxing software. – Naojiang Hu Jul 30 '20 at 18:31