3

I have an application that has so far been in English only. Content encoding throughout templates and database has been UTF-8. I am now looking to internationalize/translate the application into languages that have character sets absolutely needing UTF-8.

The application uses various PHP string functions such as strlen(), strpos(), substr(), etc, and my understanding is that I should switch these for multi-byte string functions such as mb_strlen(), mb_strlen(), mb_substr(), etc, in order for multi-byte characters to be handled correctly. I've tried to read around this topic a little but virtually everything I can find goes deep into "encoding theory" and doesn't provide a simple answer to the question: If I'm using UTF-8 throughout, can I switch from using strlen() to mb_strlen() and expect things to work normally in for example both English and Arabic, or is there something else I still need to look out for?

Any insight would be welcome, and apologies if I'm offending someone who has encoding close to their heart with my relative ignorance.

Tom
  • 30,090
  • 27
  • 90
  • 124

3 Answers3

3

No. Since bytearrays are also strings in PHP, a simple replacement of the 8-bit string functions with their mb_* counterparts will cause nothing but trouble. Functions like strlen() and substr() are probably more frequently used with bytes than actual text strings.

At the place I last worked, we managed to build a multilingual web-site (Arabic, Hindi, among other languages) in PHP without using the mbstring library at all. Text string manipulation actually doesn't happen that often. When it does, it would require far more care than just changing a function name. Most of the challenges, I've found, lie on the HTML side. Getting a page layout to work with a RTL language is the non-trivial part.

I don't know if you're just using Arabic as an example. The difficulty of internationalization can vary quite substantially depending on whether "international" means European languages only (plus Russian), or if it's inclusive of Middle-Eastern, South-Asian, and Far-East languages.

cleong
  • 7,242
  • 4
  • 31
  • 40
  • thanks, it sounds like I need to run a few tests to find the best approach for my needs. I will indeed be using Arabic, plus a few other tricky languages. The template RTL stuff doesn't worry me too much as I've worked with it in the past. The thing I'm trying to avoid is having some silly strpos() create unexpected behavior and corrupting data. – Tom Aug 20 '12 at 23:15
  • We had some problems with fixed-width database fields cutting off UTF-8 string mid-character. That has a way of causing the browser to blow up. PHP itself never was an issue for us. String manipulations require making assumptions about the script. When you're doing a truly multilingual site, you have to handle text strings essentially as opaque objects. If a problem occurs, the fix will no doubt demand a large measure of human judgement and intelligence. – cleong Aug 20 '12 at 23:42
1

Check the status of the mbstring.func_overload flag in php.ini

If (ini_get('mbstring.func_overload') & 2) then functions like strlen() (as listed here) are already overloaded by the mb_strlen() function, so there is no need for you to call the mb_* functions explicitly.

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • I have full control over the server and can easily switch those functions. I'd really just like to know whether I *can* do that without making a mess of things? – Tom Aug 20 '12 at 22:25
  • 1
    For the love of your sanity, whatever you do, do NOT enable mbstring.func_overload. At all. The setting is global, and everything that relies on manipulating strings, especially binary data will fail in ways you cannot imagine, and you'll be a quivering wreck before you get it all sorted out. – A.Grandt Aug 28 '15 at 13:15
1

The number of multibyte functions really needed are under 10, so create 3 or 5 questions whether the usage of the function or logic is good. This quesiton is obsecure and hard to answer. Small questions can get quick answers. Concrete questions can bring out good answers. let me know when you create other questions.

If you need use cases, see the fallback functions in CMSes such as Wordpress, MediaWiki, Drupal.

When you decide to start using mbstring, You should avoid using mbstring.func_overload directive. Mbstring maintainers are going to deprecate mbstring.func_overload in PHP 5.5 or 5.6 (see PHP core mailing list in 2012 April). mbstring.func_overload breaks the codebases that are not expected to use mbstring.func_overload. you can see the cases in CakePHP, Zend Framework 1x in caliculating Content-Length by using strlen().

I answerd the similar question in another place: Should i refactor all my framework to use mbstring functions?

Community
  • 1
  • 1
masakielastic
  • 4,540
  • 1
  • 39
  • 42