3

I want to get strlen() of Shift-jis and Utf-8, then compare them. A string could be mixed "ああ12345678sdfdszzz". I tried to use strlen but it generates the different results. mb_strlen also doesn't help because this is a mixed string.

For example:

ああ12345678 >> strlen() = 24 chars
ああああああああああああああああ >> strlen() = 48 chars
ああああああああああああああああああ >> strlen() = 54 chars

It seems to be there is no rule. So what is the best way to calculate strlen and compare them in multilanguage?

emeraldhieu
  • 9,380
  • 19
  • 81
  • 139
  • Judging from your examples, the `あ` in your two latter examples are 3 bytes each (might be UTF-8 then). But that doesn’t quite correlate with the first example. So how exactly are these strings build? – Gumbo Feb 13 '12 at 07:27
  • That character is Hiragana. I typed using ibus keyboard on ubuntu. I don't know why it's 3 bytes. I think it must be 2 bytes. I wonder whether there is a real rule for this. – emeraldhieu Feb 13 '12 at 07:34

3 Answers3

6

strlen does only count the bytes and thus is only useful for single-byte character encodings; use mb_strlen for multi-byte character encodings that can count the actual characters instead.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • it is a mixed string, what encoding should I pass to mb_strlen? utf8 or sjis? What if they type 5 languages? – emeraldhieu Feb 13 '12 at 07:04
  • Then how are the character encodings mixed? Note that US-ASCII is a proper subset of the UCS and is encoded exactly in both US-ASCII and UTF-8. – Gumbo Feb 13 '12 at 07:08
  • I'm sorry but what encoding will you pass to mb_strlen? I mean the second parameter of mb_strlen. – emeraldhieu Feb 13 '12 at 07:15
2

I would write a function to check from where to where a particular encoding exsist.

Then I would split the string into encodings, perform the mb_strlen and sum up the sizes afterwords. Then repeat on the second string and compare.

I guess you understand my point ;)

PS: Use mb_detect_encoding to detect encoding

mb_detect_encoding (see the comments for further ideas by the php community)

Oliver M Grech
  • 3,071
  • 1
  • 21
  • 36
0
$field = $_POST['field'];
$field_length = mb_strlen($field,'utf-8');