1

Check this snippet:

mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
mb_ereg_search_init('καλημέραCCC', 'C+');
$pos = mb_ereg_search_pos();
echo $pos[0];

(Please don't comment on this specific example, it's not my use case, it's a reduction of the problem I'm having)

Even though the string "καλημέρα" consists of 8 characters, the snippet above prints 16. Am I missing something? Isn't mb_ereg_search_init supposed to support multi-byte? And if I am, is there any built-in function that does what I need?

Thanks in advance.

Lea Verou
  • 23,618
  • 9
  • 46
  • 48
  • Doesn't a normal `preg_match` with `u` modifier work? – NikiC Feb 05 '11 at 14:33
  • preg_match returns the number of matches (0 or 1, since it stops at first match), not the position of the match in the string :/ – Lea Verou Feb 05 '11 at 14:37
  • I'm conjecturing, but `mb_internal_encoding()` might be set to UCS2, so `_pos()` returns the actual byte offset(?) – mario Feb 05 '11 at 14:38
  • And you could still use `preg_match` with [`$flags=PREG_OFFSET_CAPTURE`](http://php.net/manual/en/function.preg-match.php) - though it's not as nice an API. – mario Feb 05 '11 at 14:40
  • No, it's not the internal encoding (snippet updated). I'll try preg_match with that flag, thanks! – Lea Verou Feb 05 '11 at 14:53
  • No, the offset captured by preg_match is in bytes, even with the u modifier. :( – Lea Verou Feb 05 '11 at 14:57

1 Answers1

0

From manual page for mb_ereg_search_pos:

An array including the position of a matched part for a multibyte regular expression. The first element of the array will be the beginning of matched part, the second element will be length (bytes) of matched part. It returns FALSE on error.

My interpretation is that it's always returning number of bytes, not the actual position. If you check more of these multi-byte functions, there is at least one more that hints that it's supposed to work this way. Don't ask me what's the purpose of this function then...

If you want to know just position of first C, you can use mb_strpos:

mb_strpos('καλημέραCCC', 'C'); // 8

If you want to simply hack it at all costs, there's a solution. You have to decode the string first:

mb_ereg_search_init(utf8_decode('καλημέραCCC'), 'C+');

String becomes ????????CCC, each of question marks is exactly 1 byte and you are able to count them properly. However, if you wanna use multi-byte character in regexp now ('λ+'), it won't work.

Ondrej Slinták
  • 31,386
  • 20
  • 94
  • 126