How to strlen of a multi-language string

Question

I want to get strlen() of Shift-jis and Utf-8, then compare them. A string could be mixed "ああ12345678sdfdszzz". I tried to use strlen but it generates the different results. mb_strlen also doesn't help because this is a mixed string.

For example:

ああ12345678 >> strlen() = 24 chars
ああああああああああああああああ >> strlen() = 48 chars
ああああああああああああああああああ >> strlen() = 54 chars

It seems to be there is no rule. So what is the best way to calculate strlen and compare them in multilanguage?

Judging from your examples, the `あ` in your two latter examples are 3 bytes each (might be UTF-8 then). But that doesn’t quite correlate with the first example. So how exactly are these strings build? — Gumbo, Feb 13 '12 at 07:27
That character is Hiragana. I typed using ibus keyboard on ubuntu. I don't know why it's 3 bytes. I think it must be 2 bytes. I wonder whether there is a real rule for this. — emeraldhieu, Feb 13 '12 at 07:34

score 6 · Accepted Answer · answered Feb 13 '12 at 07:03

6

strlen does only count the bytes and thus is only useful for single-byte character encodings; use mb_strlen for multi-byte character encodings that can count the actual characters instead.

answered Feb 13 '12 at 07:03

Gumbo

643,351
109
780
844

it is a mixed string, what encoding should I pass to mb_strlen? utf8 or sjis? What if they type 5 languages? – emeraldhieu Feb 13 '12 at 07:04
Then how are the character encodings mixed? Note that US-ASCII is a proper subset of the UCS and is encoded exactly in both US-ASCII and UTF-8. – Gumbo Feb 13 '12 at 07:08
I'm sorry but what encoding will you pass to mb_strlen? I mean the second parameter of mb_strlen. – emeraldhieu Feb 13 '12 at 07:15

score 2 · Answer 2 · answered Feb 13 '12 at 07:13

I would write a function to check from where to where a particular encoding exsist.

Then I would split the string into encodings, perform the mb_strlen and sum up the sizes afterwords. Then repeat on the second string and compare.

I guess you understand my point ;)

PS: Use mb_detect_encoding to detect encoding

mb_detect_encoding (see the comments for further ideas by the php community)

score 0 · Answer 3 · answered Nov 15 '14 at 14:55

0

$field = $_POST['field'];
$field_length = mb_strlen($field,'utf-8');

answered Nov 15 '14 at 14:55

David Alexandrovich

21
7

How to strlen of a multi-language string

3 Answers3