Python: Determining if i have a 16bit encoded string

Question

I have a UTF-16-BE encoded string:

utf16be = '\x0623\x0631\x0646\x0628'

print repr(utf16be)
> '\x0623\x0631\x0646\x0628'

I need to know if it's a 1-byte or 2-byte encoding, i have tried with the below snippet:

for c in utf16be:
    c_ord = ord(c)
    if c_ord >= 256:
        print 'Its a 2-byte (or more) encoded string'
        break

But that wont work because i thought utf16be[0] will be equal to '\x0623', but it's actually equal to '\x06':

for c in utf16be:
    print repr(c)

> '\x06'
> '2'
> '3'
> '\x06'
> '3'
> '1'
> '\x06'
> '4'
> '6'
> '\x06'
> '2'
> '8'

So what is the best practice to check if i have a 2-byte encoded string ?

you cannot ... you need to know in advance ... as "\x0034" will treat it as "\x00" + "3" + "4" ... how would you know that thats not that? — Joran Beasley, Dec 05 '13 at 23:37
Ideally, you'd know because something, somewhere told you the encoding. Otherwise, the best you can do is try to decode it with various encodings and see if they work. — user2357112, Dec 05 '13 at 23:37
I have written an utf16be string for the sake of simplicity, but in real env. i would get any string as input and i must know about its byte encoding (1, 2 or more bytes per char) — zfou, Dec 05 '13 at 23:41
"I have written an utf16be string" - no, you haven't. You tried, but what you have is something else entirely. If you want an example string encoded the way you want, `encode` a Unicode string. — user2357112, Dec 05 '13 at 23:43
\x0623 is an UTF16 character i picked up from UTF16 table, why shall i write it in unicode and then encode it to UTF16 when i can get directly in the latter ? — zfou, Dec 05 '13 at 23:52
so you can see what a utf16 encoded string looks like in python maybe? — Joran Beasley, Dec 05 '13 at 23:54

score 1 · Answer 1 · answered Dec 05 '13 at 23:42

1

A UTF-16-BE encoded string necessarily has two bytes per code unit (hence the name 16 bits). UTF-8 has single bytes but UTF-16 does not.

Your comment suggests you're getting a string and you need to figure out whether it's one, two or more bytes per character but that doesn't make sense. You need to know the encoding of the string to make sense of it - otherwise it's guesswork.

answered Dec 05 '13 at 23:42

Simeon Visser

118,920
18
185
180

repr(utf16be) is returning what i'm looking for, it knows that it's a 2 byte encoding since it's returning \x0623 and not \x06\x23, so how it's done ? – zfou Dec 05 '13 at 23:46
4

it does not ... its returning `"\x06" + "\x32"+"\x33"` (or `"\x06" + "2" + "3"`) ... additionally if you dont know the encoding who are you to say thats not the correct interpretation (you do know in this case __because you know the encoding__) a string that is "\x06\x23" would actually be the 16 bits ... "\x0623" is 24 bits ... – Joran Beasley Dec 05 '13 at 23:49

score 0 · Answer 2 · answered Dec 05 '13 at 23:41

0

Use chardet package to guess encoding

answered Dec 05 '13 at 23:41

anijhaw

8,954
7
35
36

Am not about to detect encoding, i need to know if its 1, 2 or more bytes per char encoding, no matter if it's UTF-16-BE or KANJI_JIS ... – zfou Dec 05 '13 at 23:44
1

all you can do is guess – Joran Beasley Dec 05 '13 at 23:45

Python: Determining if i have a 16bit encoded string

2 Answers2