2

Some years ago, I built a good custom PHP CMS Site, but I overlooked one important issue: unicode support. This was primarily due to the fact that at the time, the users were English-speaking and that was to remain the case for the foreseeable future. Another factor was PHP's poor unicode support to begin with.

Well, now the day of reckoning has come. I want there to be support for unicode, specifically UTF8, but I have one major obstacle: PHP's string functions. Correct me if I'm wrong, but even now, in the world of PHP 5.5, PHP's regular string functions (i.e. strlen, substr, str_replace, strpos, etc) do not fully support unicode. On the other hand, PHP's mb_string functions do support unicode, but I have read that they may be rather resource heavy (which makes sense since we would be dealing with multibyte characters as opposed to single byte characters).

So, the way I see it, there are three solutions:

1) Use multibyte string functions in all cases.

A. Try to override the standard string functions with their multibyte counterparts. Speaking of which, were I to do this, what is the best way to do so?

B. Painstakingly go through all my code and replace the standard string functions with their multibyte function counterparts.

2) Painstakingly go through all my code and replace standard string functions that would work with user input, database data, etc with their multibyte function counterparts. This would require me to look at every usage of every string function carefully in my code to determine whether it has even the slightest chance of dealing with multibyte characters.

The benefit of this is that I would have the optimal running time while at the same time fully supporting unicode. The drawback here is that this would be very time-consuming (and extremely boring, I might add) to implement and there would always be the chance I'd miss using a multibyte string function where I should.

3) Overhaul my software entirely and start from scratch. But this is something I'm trying to avoid.

If there are other options available, please let me know.

J Johnson
  • 168
  • 3
  • 11

1 Answers1

2

I'd go for a variation of 1.B:

1.B.2) Use an automatical "Search and Replace" function (a single carefully crafted sed command might do it).

Reason for 1 in favor of 2: premature optimization is the root of all evil. I don't know where you read that the mb_ functions were "resource heavy" but plainly spoken it's utter nonsense. Of course they take a few more CPU cycles but that is a dimension that you really should not worry about. For some reason PHP developers love to discuss about such micro optimization like "are single quotes faster than double quotes" while they should focus on the things that really make a difference (mostly I/O and database). Really, it's not worth any effort.

Reason for automation: it's possible, it's more efficient, do you need more arguments?

Fabian Schmengler
  • 24,155
  • 9
  • 79
  • 111
  • Why not simply override the functions then? What's the point of replacing them all when it would be easier to override them? Plus, it seems ugly to have all these mb_ functions around if we are going to make all string functions be multibyte. Better to just have PHP "act" like a professional language (where the default string functions already deal with multibyte characters). – J Johnson Mar 22 '13 at 21:11
  • If you want to "simply" hack the PHP core with all the consequences. – Fabian Schmengler Mar 23 '13 at 08:51
  • Loss of portability and upgradability mainly. To be clear, we're talking about changing the PHP source code and compiling your own version of PHP. – Fabian Schmengler Mar 25 '13 at 21:58