get utf-8 character code given a shift-jis character code?

Question

In my program I get shift-jis character codes as Python integers which I need to convert to their corresponding utf8 character codes (which should also be in integers). How can I do that? For ASCII you have the helpful functions ord()/chr() which allows you to convert an integer into an ASCII string which you can easily convert to unicode later. I can't find anything like that for other encodings.

Using Python 2.

EDIT: the final code. Thanks everyone:

def shift_jis2unicode(charcode): # charcode is an integer
    if charcode <= 0xFF:
        string = chr(charcode)
    else:
        string = chr(charcode >> 8) + chr(charcode & 0xFF)

    return ord(string.decode('shift-jis'))

print shift_jis2unicode(8140)

It's unusual to get them as integers rather than as bytes - is that something you can change? — Thomas K, Feb 24 '12 at 18:14
Sorry, I can't. BTW, I think "bytes" is something new in Python 3, I use 2. — Alex C, Feb 24 '12 at 18:17
Python 2 `str` works like bytes, and it has a `bytes` alias in 2.6 and 2.7. — Thomas K, Feb 24 '12 at 18:20
Well, I wish I could. That's why I posted this question. If I could get it as string, I could just do mystr.decode('shift_jis') and then ord() on that. But I can't. — Alex C, Feb 24 '12 at 18:24
Show some sample data to give us a better idea of what you're working with. — Ignacio Vazquez-Abrams, Feb 24 '12 at 18:31
`ord()` would give you unicode code points, not utf-8. That may be what you want, but those are very different things. — Thomas K, Feb 24 '12 at 18:37
Sorry, but I think I explained exactly what I have and what I need to do with it. — Alex C, Feb 24 '12 at 18:39
Thomas K: I guess you're right. Still, I haven't even reached that point: I don't even know how to get the integer character code into a string of shift-jis encoding. — Alex C, Feb 24 '12 at 18:42
I thought it was "str" which was reserved. Anyway, not even in a function? — Alex C, Feb 24 '12 at 20:20
"str" is a built-in type. "string" is a built-in module. Technically you can use them for variable names, but it's confusing. Better to avoid them. — user9876, Feb 24 '12 at 20:23

score 2 · Accepted Answer · answered Feb 24 '12 at 20:20

2

There's no such thing as "utf8 character codes (which should also be in integers)".

Unicode defines "code points", which are integers. UTF-8 defines how to convert those code points to an array of bytes.

So I think you want the Unicode code points. In that case:

def shift_jis2unicode(charcode): # charcode is an integer
    if charcode <= 0xFF:
        shift_jis_string = chr(charcode)
    else:
        shift_jis_string = chr(charcode >> 8) + chr(charcode & 0xFF)

    unicode_string = shift_jis_string.decode('shift-jis')

    assert len(unicode_string) == 1
    return ord(unicode_string)

print "U+%04X" % shift_jis2unicode(0x8144)
print "U+%04X" % shift_jis2unicode(0x51)

(Also: I don't think 8100 is a valid shift-JIS character code...)

answered Feb 24 '12 at 20:20

user9876

10,954
6
44
66

8100 was kind of a guess and a wrong one. Don't get the whole unicode vs utf-8 business. I think you are right though. – Alex C Feb 24 '12 at 20:27
@AlexC, Unicode strings are made up of codepoints (generally one per character) and `ord` will convert a codepoint to an integer. UTF-8 is a representation of a codepoint in 1 or more 8-bit bytes. – Mark Ransom Feb 24 '12 at 21:18
For a good intro to Unicode and all the encoding issues, I recommend "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" - http://www.joelonsoftware.com/articles/Unicode.html – user9876 Mar 01 '12 at 22:27

score 1 · Answer 2 · answered Feb 24 '12 at 18:50

1

There may be a better way to do this, but since there are no other answers yet here is an option.

You could use this table to convert your shift-jis integers to unicode code points, then use unichr() to convert your data into a Python unicode object, and then convert it from unicode to utf8 using unicode.encode('utf-8').

answered Feb 24 '12 at 18:50

Andrew Clark

202,379
35
273
306

Thanks. I'm already using a custom table. I thought if I could use what Python provides, the code would be cleaner and I wouldn't need to have an extra file holding all the character codes. – Alex C Feb 24 '12 at 18:55

Mark Ransom · Answer 3 · 2012-02-24T19:28:56.310

0

def from_shift_jis(seq):
    chars = [chr(c) if c <= 0xff else chr(c>>8) + chr(c&0xff) for c in seq]
    return ''.join(chars).decode('shift-jis')

utf8_output = [ord(c) for c in from_shift_jis(shift_jis_input).encode('utf-8')]

edited Feb 24 '12 at 19:28

answered Feb 24 '12 at 19:15

Mark Ransom

299,747
42
398
622

What does "chr(c>>8) + chr(c&0xff)" do? – Alex C Feb 24 '12 at 19:46
@AlexC, `c>>8` shifts the upper 8 bits of the integer into the lower 8 bits, and `c&0xff` strips off the upper 8 bits. It's a way of splitting an integer into two 8-bit parts. The `chr` converts to a character as you know, and `+` appends them into a two-character string. – Mark Ransom Feb 24 '12 at 20:01
OK. I'm having trouble now actually converting the unicode string to an utf-8 character code integer. I'll update my question with the code I have so far, please have a look. – Alex C Feb 24 '12 at 20:06
@AlexC, I think you want `0x8100` rather than `8100` in your test code. – Mark Ransom Feb 24 '12 at 20:16
I think 0xFF and 255 are the exact same thing in Python. Still an error anyway. – Alex C Feb 24 '12 at 20:17
@AlexC, yes 0xFF and 255 are the same but that's not what I was talking about. `0x8100` is a valid shift-jis character but `8100` is not. – Mark Ransom Feb 24 '12 at 20:20
It is? shift-jis codec fails to decode it. Anyway, the real problem is ord() accepts a char (string of lenght 1), while we pass a two byte string to it. Got to think of something else... – Alex C Feb 24 '12 at 20:24
@AlexC, did you try running the exact code I gave you? It should work fine, returning a list of ints each of which is a utf-8 byte. – Mark Ransom Feb 24 '12 at 20:27
@AlexC, sorry you're correct - 0x8100 isn't valid shift-jis, it starts at 0x8140. – Mark Ransom Feb 24 '12 at 20:29

get utf-8 character code given a shift-jis character code?

3 Answers3