1

I have read several things about this topic but still I have doubts I want to share with the community.

I want to add a complete utf-8 support to the application I developed, DaDaBIK; the application can be used with different DBMSs (such as MySQL, PostgreSQL, SQLite). The charset used in the databases can be ANY. I cant' set or assume the charset.

My approach would be convert, using iconv functions, everything i read from the db in utf-8 and then convert it back in the original charset when I have to write to the DB. This would allow me to assume I'm working with utf-8.

The problem, as you probably know, is that PHP doesn't support utf-8 natively and, even assuming to use mbstring, there are (according to http://www.phpwact.org/php/i18n/utf-8) several PHP functions which can create problems with utf-8 and DON't have an mbstring correspondance, for example the PREG extension, strcspn, trim, ucfirst, ucwords....

Since I'm using some external libraries such as adodb and htmLawed I can't control all the source code...in those libraries there are several cases of usage of those functions....do you have any advice about? And above all, how very popular applications like wordpress and so on are handling this (IMHO big) problem? I doubt they don't have any "trim" in the code....they just take the risk (data corruption for example) or there is something I can't see?

Thanks a lot.

Eugenio
  • 3,195
  • 5
  • 33
  • 49

1 Answers1

2

First of all: PHP supports UTF-8 just fine natively. Only a few of the core functions dealing with strings should not be used on multi-byte strings.

It entirely depends on the functions you are talking about and what you're using them for. PHP strings are encoding-less byte arrays. Most standard functions therefore just work on raw bytes. trim just looks for certain bytes at the start and end of the string and trims them off, which works perfectly fine with UTF-8 encoded strings, because UTF-8 is entirely ASCII compatible. The same goes for str_replace and similar functions that look for characters (bytes) inside strings and replace or remove them.

The only real issue is functions that work with an offset, like substr. The default functions work with byte offsets, whereas you really want a more intelligent character offset, which does not necessarily correspond to bytes. For those functions an mb_ equivalent typically exists.

preg_ supports UTF-8 just fine using the /u modifier.

If you have a library which uses, for instance, substr on a potential multi-byte string, use a different library because it's a bad library.

See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for some more in-depth discussion and demystification about PHP and character sets.

Further, it does not matter what the strings are encoded as in the database. You can set the connection encoding for the database, which will cause it to convert everything for you and always return you data in the desired client encoding. No need for iconverting everything in PHP.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • Time to give you a BUMMER! I did NOT ask this question. I was just a passer-by who came to see who has this freakin level of patience to rely here ;) – Keval Domadia Sep 05 '12 at 12:14
  • 1
    *OUCH!* Sorry 'bout that. You two look alike. X-D – deceze Sep 05 '12 at 12:15
  • Now I'm confused though, since there's another "Passerby" in the comments to the question... O_o;;? – deceze Sep 05 '12 at 12:20
  • I just checked your blog via your profile. Awesome stuff. Write a book. -___- – Keval Domadia Sep 05 '12 at 12:30
  • Yes, I will, as long as its Amazon :) – Keval Domadia Sep 05 '12 at 12:32
  • Hehe. Let's see, maybe one day down the line... :) – deceze Sep 05 '12 at 12:33
  • There are many few "basics" which we developers have missed out on! You pointed them all (atleast all of what I could think of) I have got your blog bookmarked! – Keval Domadia Sep 05 '12 at 12:33
  • Of course I was referring to string functions; I thought it was clear :) substr: it wasn't even in the list I mentioned, there is an mbstring equivalent so of course there is no problem. Trim: if in the 2nd parameter there is a multibyte char there is a risk, it is pretty evident. PREG: the page I mentioned explains pretty well the risks, if you think that explanation has some flaws, you should propose counter arguments. And there are risks also with strcspn,ucfirst,ucwords. Finally are you sure your last sentence is true for all popular DBMS? I admit I am not sure, I will investigate it. – Eugenio Sep 05 '12 at 13:23
  • @Eugenio I can't take that page too seriously. Example: *"`htmlentities`, Risk: high, [...] Rumour - although this function (claims) to have UTF-8 support, bug reports claim it’s broken at least until PHP 5. Using it on a UTF-8 string with the wrong charset would, very likely, result in corruption / junk output."* So what is it? **HIGH RISK**? Rumor? Buggy in age-old versions? Only if used incorrectly? Well sure, that's my point, if used incorrectly almost anything is "dangerous". There are also no concrete test cases for any of the assertions on that page. Take it all with a grain of salt. – deceze Sep 05 '12 at 13:32
  • @Eugenio Also, some things seem plain wrong or outdated. According to that page, `preg_match('/\w+/u', '漢字!!')` should not work. Well, it does though: http://codepad.viper-7.com/GKIpxG – deceze Sep 05 '12 at 13:37
  • @deceze I actually think that page is very well done, that's why is still one of the most linked source when it comes to talk about multibyte and PHP. Again, for PCRE functions and the others I mentioned, the explanation makes sense so you can't just say it is not true without proposing counter arguments :) – Eugenio Sep 05 '12 at 13:44
  • @Eugenio Please talk specifics. What is the risk? There are three points regarding PCRE: 1) the `/i` modifier, which says *"Unless the /u modifier is used..."* as its only risk, which, again, comes down to *using it properly*. 2) The `/u` modifier itself, which only lists a real non-issue as issue. 3) The `\w` and `\b` classes, which I demonstrated to be an incorrect "issue". – deceze Sep 05 '12 at 13:48
  • @Eugenio Also note that that page hasn't been substantially updated since 2006! If there have been any bugs, they have most likely all been fixed since. For most other functions they list, it pretty much comes down to "understand what this does and use it properly", which is exactly what I'm saying. – deceze Sep 05 '12 at 13:50
  • So... `strcspn` is a UTF8 safe function? Can I use it in a UTF8 context? – Peter Krauss Oct 12 '14 at 17:38
  • @PeterKrauss I believe it should be, there's nothing that inherently requires this function to *understand* encodings. Be aware however that the return value with be counted *bytes*, not *characters*. – deceze Oct 13 '14 at 08:28