1

If that's relevant (it very well could be), they are PHP source code files.

JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
julien_c
  • 4,942
  • 5
  • 39
  • 54
  • Slightly offtopic - any serious project should store all data (including UI texts) in some DB rather than have it hard-coded in source code files. If you follow this, then only code comments might require UTF8. – binaryLV Apr 05 '11 at 14:20
  • For localization. Even if localization is not needed, it might be in future. – binaryLV Apr 06 '11 at 06:11
  • @binaryLV Well, yeah, that's the thing, these PHP files are for localization, actually. – julien_c Apr 06 '11 at 09:32

4 Answers4

7

There are a few pitfalls to take care of:

  1. PHP is not aware of the BOM character certain editors or IDEs like to put at the very beginning of UTF-8 files. This character indicates the file is UTF-8, but it is not necessary, and it is invisible. This can cause "headers already sent out" warnings from functions that deal with HTTP headers because PHP will output the BOM to the browser if it sees one, and that will prevent you from sending any header. Make sure your text editor has a UTF-8 (No BOM) encoding; if you're not sure, simply do the test. If <?php header('Content-Type: text/html') ?> at the beginning of an otherwise empty file doesn't trigger a warning, you're fine.
  2. Default string functions are not multibyte encodings-aware. This means that strlen really returns the number of bytes in the string, not the actual number of characters. This isn't too much of a problem until you start splicing strings of non-ASCII characters with functions like substr: when you do, indices you pass to it refer to byte indices rather than character indices, and this can cause your script to break non-ASCII characters in two. For instance, echo substr("é", 0, 1) will return an invalid UTF-8 character because in UTF-8, é actually takes two bytes and substr will return only the first one. (The solution is to use the mb_ string functions, which are aware of multibyte encodings.)
  3. You must ensure that your data sources (like external text files or databases) return UTF-8 strings too, because PHP makes no automagic conversion. To that end, you may use implementation-specific means (for instance, MySQL has a special query that lets you specify in which encoding you expect the result: SET CHARACTER SET UTF8 or something along these lines), or if you couldn't find a better way, mb_convert_encoding or iconv will convert one string into another encoding.
zneak
  • 134,922
  • 42
  • 253
  • 328
  • 1
    Good answer. Just wanted to add that there are "multibyte substitutes" for string functions, e.g., `mb_strlen()` and `mb_substr()`. – binaryLV Apr 05 '11 at 14:06
  • +1 nice typing speed:) (and nice answer too). I was just editing my answer to add details about BOM and `mb_*` functions when I saw your answer:). – Slava Apr 05 '11 at 14:16
1

It's actually usually recommended that you keep all sources in UTF8. It won't matter size of regular code with latin characters at all, but will prevent glitches with any special characters.

Slava
  • 2,040
  • 15
  • 15
0

If you are using any special chars in e.g string values, the size is a little bit bigger, but that shouldn't matter.

Nevertheless my suggestion is, to always leave the default format. I spent so many hours because there was an error with the format saving and all characters changed.

From a technical point of few, there isn't a difference!

Stefan
  • 14,826
  • 17
  • 80
  • 143
  • Since the default varies from editor to editor (some of which pull it from the environment, which again varies). Leaving it as the default is a pretty bad idea. Far better to pick an encoding and then make sure everything uses it. – Quentin Apr 05 '11 at 14:06
-1

Very relevant, the PHP parser may start to output spurious characters, like a funky unside-down questionmark. Just stick to the norm, much preferred.

Richard Dickinson
  • 288
  • 1
  • 3
  • 10
  • -1, UTF-8 should be the norm, even in PHP. You only need to use multibyte-aware functions when dealing with strings to avoid the funky question mark characters. – zneak Apr 05 '11 at 14:05
  • 2
    UTF-8 is the norm. The alternatives are UTF-16 (Better for Asian), UTF-32 (Err, no), ISO-8859 (Legacy), ASCII (Limited) and proprietry stuff. – Quentin Apr 05 '11 at 14:05
  • Can I ask why I have been marked down. My answer is true, go ahead and encode your PHP documents in Unicode, you'll not get any output, which validates my statement. Further to that, ANSI is the normal in all text editors unless the user has changed the settings, this is accross the board for C#/C++/VB/PHP/JS and so so so many more. – Richard Dickinson Apr 05 '11 at 14:31
  • @Richard Dickinson, ANSI might be *default*, but not *the normal*. – binaryLV Apr 05 '11 at 14:36
  • 1
    ANSI might be the default on Windows, but UTF-8 is the default on Mac OS and Linux. I, for one, use Mac OS, and all my scripts (with lots of French accentuated letters) are encoded in UTF-8, and they run properly, not only on my machine, but also on the many web servers I use and have used. There _are_, however, a few pitfalls; I've listed them in my answer. – zneak Apr 05 '11 at 14:58
  • Okay, may be so, however, let's not get heated on something, my answer was only re-inforcing the fact that it is important what you choose, as I said, Unicode doesn't work on PHP Documents. – Richard Dickinson Apr 05 '11 at 15:30
  • 3
    My definition of "PHP document" is a document that contains PHP code ready to be executed by the PHP script engine. I encode those documents as UTF-8, and not only do they work on my machine, but they also work portably across all the web servers I use, with non-English characters. Do you mean some other kind of document? – zneak Apr 05 '11 at 17:44