How do decode b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"?

Question

[Summary]: The data grabbed from the file is

b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"

How to decode these bytes into readable Chinese characters please?

======

I extracted some game scripts from an exe file. The file is packed with Enigma Virtual Box and I unpacked it.

Then I'm able to see the scripts' names just right, in English, as it supposed to be.

In analyzing these scripts, I get an error looks like this:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte

I changed the decoding to GBK, and the error disappeared.

But the output file is not readable. It includes readable English characters and non-readable content which supposed to be in Chinese. Example:

chT0002>pDIӘIʆ

I tried different encodings for saving the file and they show the same result, so the problem might be on the decoding part.

The data grabbed from the file is

b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"

I tried many ways but I just can't decode these bytes into readable Chinese characters. Is there anything wrong with the file itself? Or somewhere else? I really need help, please.

One of the scripts are attached here.

`b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba".decode("utf-16")` works? — krxat, Jul 30 '20 at 17:57
Are you getting an error or the output doesn't make sense? Since I got this output "쎕낊玍覆芔몊", but I don't read Chinese, so I cant confirm. — krxat, Jul 30 '20 at 18:04
You cannot choose which encoding you wish. You have to know which encoding was used to create that. It is obviously not utf-8, because Python is telling you that. — zvone, Jul 30 '20 at 18:16

Sam Morgan · Accepted Answer · 2023-06-28T17:44:47.567

In order to reliably decode bytes, you must know how the bytes were encoded. I will borrow the quote from the python codecs docs:

Without external information it’s impossible to reliably determine which encoding was used for encoding a string.

Without this information, there are ways to try and detect the encoding (chardet seems to be the most widely-used). Here's how you could approach that.

import chardet

data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
detected = chardet.detect(data)
decoded = data.decode(detected["encoding"])

The above example, however, does not work in this case because chardet isn't able to detect the encoding of these bytes. At that point, you'll have to either use trial-and-error or try other libraries.

One method you could use is to simply try every standard encoding, print out the result, and see which encoding makes sense.

codecs = [
    "ascii", "big5", "big5hkscs", "cp037", "cp273", "cp424", "cp437", "cp500", "cp720", 
    "cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858", "cp860",
    "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875",
    "cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1125", "cp1140", "cp1250",
    "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257",
    "cp1258", "cp65001", "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312",
    "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2",
    "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1",
    "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6", "iso8859_7",
    "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_11", "iso8859_13", "iso8859_14",
    "iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_t", "koi8_u", "kz1048",
    "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman",
    "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", "shift_jisx0213",
    "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7",
    "utf_8", "utf_8_sig",
]

data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"

for codec in codecs:
    try:
        print(f"{codec}, {data.decode(codec)}")
    except UnicodeDecodeError:
        continue

Output

cp037, nC«^ýËfimb«[
cp273, nC«¢ýËfimb«¬
cp437, ò├è░ìsåëöéè║
cp500, nC«¢ýËfimb«¬
cp720, ـ├è░së¤éè║
cp737, Χ├Λ░ΞsΗΚΦΓΛ║
cp775, Ģ├Ŗ░ŹsåēöéŖ║
cp850, ò├è░ìsåëöéè║
cp852, Ľ├Ő░ŹsćëöéŐ║
cp855, Ћ├і░ЇsєЅћѓі║
cp856, ץ├ך░םsזיפגך║
cp857, ò├è░ısåëöéè║
cp858, ò├è░ìsåëöéè║
cp860, ò├è░ìsÁÊõéè║
cp861, þ├è░Þsåëöéè║
cp862, ץ├ך░םsזיפגך║
cp863, Ï├è░‗s¶ëËéè║
cp864, ¼ﺃ├٠┌s│┬½∙├ﻑ
cp865, ò├è░ìsåëöéè║
cp866, Х├К░НsЖЙФВК║
cp875, nCα£δΉfimbας
cp949, 빩뒺뛱냹봻듆
cp1006, ﺣﺍsﭦ
cp1026, nC«¢`Ëfimb«¬
cp1125, Х├К░НsЖЙФВК║
cp1140, nC«^ýËfimb«[
cp1250, •ĂŠ°Ťs†‰”‚Šş
cp1251, •ГЉ°Ќs†‰”‚Љє
cp1256, •أٹ°چs†‰”‚ٹ؛
gbk, 暶姲峴唹攤姾
gb18030, 暶姲峴唹攤姾
latin_1, Ã°sº
iso8859_2, Ă°sş
iso8859_4, Ã°sē
iso8859_5, УАsК
iso8859_7, Γ°sΊ
iso8859_9, Ã°sº
iso8859_10, Ã°sš
iso8859_11, รฐsบ
iso8859_13, Ć°sŗ
iso8859_14, ÃḞsẃ
iso8859_15, Ã°sº
iso8859_16, Ă°sș
koi8_r, ∙ц┼╟█s├┴■┌┼╨
koi8_u, ∙ц┼╟█s├┴■┌┼╨
kz1048, •ГЉ°Қs†‰”‚Љғ
mac_cyrillic, Х√К∞НsЖЙФВКЇ
mac_greek, ïΟäΑçsÜâî²äΚ
mac_iceland, ï√ä∞çsÜâîÇä∫
mac_latin2, ē√äįćsÜČĒāäļ
mac_roman, ï√ä∞çsÜâîÇä∫
mac_turkish, ï√ä∞çsÜâîÇä∫
ptcp154, •ГҠ°ҚsҶү”ӮҠә
shift_jis_2004, 陛寛行̹狽桓
shift_jisx0213, 陛寛行̹狽桓
utf_16, 쎕낊玍覆芔몊
utf_16_be, 闃誰赳蚉钂誺
utf_16_le, 쎕낊玍覆芔몊

Edit: After running all of the seemingly legible results through Google Translate, I suspect this encoding is UTF-16 big-endian. Here's the results:

Encoding	Decoded	Language Detected	English Translation
gbk	暶姲峴唹攤姾	Chinese	Jian Xian JiaoTanJiao
gb18030	暶姲峴唹攤姾	Chinese	Jian Xian Jiao Tan Jiao
utf_16	쎕낊玍覆芔몊	Korean	None
utf_16_be	闃誰赳蚉钂誺	Chinese	Who is the epiphysis?
utf_16_le	쎕낊玍覆芔몊	Korean	None

Thanks for helping, this is a really good way to test encodings. Unfortunately none of these results are in Chinese, but it clearly shows that the problem should be on the file itself. I'll go and try another unboxing software. — Naojiang Hu, Jul 30 '20 at 18:31

How do decode b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"?

1 Answers1

Output