-1

I have a UTF-16-BE encoded string:

utf16be = '\x0623\x0631\x0646\x0628'

print repr(utf16be)
> '\x0623\x0631\x0646\x0628'

I need to know if it's a 1-byte or 2-byte encoding, i have tried with the below snippet:

for c in utf16be:
    c_ord = ord(c)
    if c_ord >= 256:
        print 'Its a 2-byte (or more) encoded string'
        break

But that wont work because i thought utf16be[0] will be equal to '\x0623', but it's actually equal to '\x06':

for c in utf16be:
    print repr(c)

> '\x06'
> '2'
> '3'
> '\x06'
> '3'
> '1'
> '\x06'
> '4'
> '6'
> '\x06'
> '2'
> '8'

So what is the best practice to check if i have a 2-byte encoded string ?

zfou
  • 891
  • 1
  • 10
  • 33
  • you cannot ... you need to know in advance ... as "\x0034" will treat it as "\x00" + "3" + "4" ... how would you know that thats not that? – Joran Beasley Dec 05 '13 at 23:37
  • Ideally, you'd know because something, somewhere told you the encoding. Otherwise, the best you can do is try to decode it with various encodings and see if they work. – user2357112 Dec 05 '13 at 23:37
  • I have written an utf16be string for the sake of simplicity, but in real env. i would get any string as input and i must know about its byte encoding (1, 2 or more bytes per char) – zfou Dec 05 '13 at 23:41
  • "I have written an utf16be string" - no, you haven't. You tried, but what you have is something else entirely. If you want an example string encoded the way you want, `encode` a Unicode string. – user2357112 Dec 05 '13 at 23:43
  • \x0623 is an UTF16 character i picked up from UTF16 table, why shall i write it in unicode and then encode it to UTF16 when i can get directly in the latter ? – zfou Dec 05 '13 at 23:52
  • so you can see what a utf16 encoded string looks like in python maybe? – Joran Beasley Dec 05 '13 at 23:54

2 Answers2

1

A UTF-16-BE encoded string necessarily has two bytes per code unit (hence the name 16 bits). UTF-8 has single bytes but UTF-16 does not.

Your comment suggests you're getting a string and you need to figure out whether it's one, two or more bytes per character but that doesn't make sense. You need to know the encoding of the string to make sense of it - otherwise it's guesswork.

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • repr(utf16be) is returning what i'm looking for, it knows that it's a 2 byte encoding since it's returning \x0623 and not \x06\x23, so how it's done ? – zfou Dec 05 '13 at 23:46
  • 4
    it does not ... its returning `"\x06" + "\x32"+"\x33"` (or `"\x06" + "2" + "3"`) ... additionally if you dont know the encoding who are you to say thats not the correct interpretation (you do know in this case __because you know the encoding__) a string that is "\x06\x23" would actually be the 16 bits ... "\x0623" is 24 bits ... – Joran Beasley Dec 05 '13 at 23:49
0

Use chardet package to guess encoding

anijhaw
  • 8,954
  • 7
  • 35
  • 36