1

I'm trying to decode Japanese strings in a loop that reads a file with shift-jis.

It works, but when it contains circled numbers characters like "①", I get the following error:

UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 24-25: illegal multibyte sequence

Some of the code:

def read_short(data):
    return unpack('>h', data.read(2))[0]
def read_string(data):
    length = read_short(data)
    return unpack(str(length) + 's', data.read(length))[0].decode('shift-jis')

test = read_string(data)

Is there a Japanese codec able to read that type of chars or do I have to find to way to convert it beforehand?

dspencer
  • 4,297
  • 4
  • 22
  • 43
Roxxxance
  • 11
  • 2
  • Can you share sample input and expected output – Anshul May 25 '20 at 06:56
  • Aren't those characters decodable by Unicode codecs? – Torxed May 25 '20 at 06:58
  • Expected output would be the full string (with japanese characters that I decode with shift-jis) including the circled digits (lile ①). For the input, I can't figure out how to print it (because I can't decode it. it's from a big binary file). BTW I'm quite new with python (and not very good at coding it seems). @Torxed: I don't think so, because UTF8 should be able to decode it if it was decodable with unicode codecs I guess. – Roxxxance May 25 '20 at 09:42
  • Could you give us a chunk of the input data represented in bytes string? `with open('source.txt', 'rb') as fh: print(fh.read(20))` and add it to your question. UTF-8 won't support it, but perhaps `UTF-16` does if you know the data is packed in a 2-byte format. Altho I suspect that they might be packed in bytes of 3 per actual character. So perhaps `UTF-32`. – Torxed May 25 '20 at 09:48
  • I isolated the part that contains the characters that I can't decode with a `pprint(unpack(str(length) + 's', data.read(length))[0])` in my loop. Here's hat it returns: '\x82\xa8\x82\xb7\x82\xb5\x83N\x83b\x83V\x83\x87\x83\x93\x83V\x83\x8a\x81[\x83Y\x87@\x81@\x96\x8e%r\x82\xe0\x82\xc1\x82\xbf\x82\xe8\x83V\x83\x83\x83\x8a\x82\xc9\x82\xa8\x8dD\x82\xab\x82\xc8\x83l\x83^\x82\xc5%r\x8fW\x82\xdf\x82\xc4\x8ay\x82\xb5\x82\xe0\x82\xa4\x81I' Reading some data from the file with `with open`returns: DS_BASE2 +h▒i ▒▒g▒'▒▒▒ l▒+▒▒▒Խ▒▒s▒▒▒gg▒▒A▒C▒ – Roxxxance May 25 '20 at 10:38
  • Found the characters that python can't decode: \x87@\x81@. Now I need to find how to decode it if it is possible. – Roxxxance May 25 '20 at 19:39

2 Answers2

0

Well, I'm a bit dumb. Solved it simply by using cp932 codec instead.

Roxxxance
  • 11
  • 2
0

You will have to use codec decode('cp932') rather than decode('shift-jis') to handle characters like or .

Strictly speaking, characters like or are not contained in Shift-JIS character set (JIS X 0201 and JIS X 0208 in the Japanese Industrial Standard).

Characters like or are containd in Microsoft-specific character set. Many programming languages call it "MS932", "CP932" or "Windows-31J", but Microsoft calls it just "Shift-JIS" in most of their documents and applications. You will have to use cp932 to handle all characters of so-called "Shift-JIS" in most situations.

SATO Yusuke
  • 1,600
  • 15
  • 39