Php cannot find way to split utf-8 strings

Question

i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.

I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this) which i then proceed to reading it using

$file = fopen("file.txt", "r");
while(!feof($file)){
    $line = fgets($file);
    //...
}
fclose($file);

using mb_detect_encoding($line) reports UTF-8
If i do echo $line I can see the line properly (no mangled characters) in the browser
- so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)

When i try to split the string using $arr = mb_split(';',$line) the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0]) reports utf-8 as well).

So echo $arr[0] will result in something like this: ï»¿Î‘Î˜Î—ÎÎ.

I have tried setting mb_detect_order('utf-8'), mb_internal_encoding('utf-8'), but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.

So my question is how can i properly split the string? Is going down the mb_ path the wrong way? What am I missing?

Thank you for your help!

UPDATE: I'm adding sample strings and base64 equivalents (thanks to @chris' for his suggestion)

1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ï»¿Î‘Î˜Î—ÎÎ‘"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="

Ok, so after doing this there seems to be a 77u/ difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?

UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file) reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded the first and second line, and the 77u/ bom appeared on the base64'd string of the first line only. I then opened up the offending file in vim, and entered :set nobomb :w to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on @hakre's remove_utf8_bom i added it's complementary function

function add_utf8_bom($str){
    $bom= "\xEF\xBB\xBF";
    return substr($str,0,3)===$bom?$str:$bom.$str;
}

and voila each line is read correctly now.

I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.

Thanks to @chris, @hakre and @jacob for their time!

UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />. It also had to be properly enclosed inside an <html><body> section or the browser would not understand the encoding correctly. Thanks to @jake for his suggestion.

Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.

I recommend you post sample strings(before and after the split) for people to inspect. To preserve them binary safe, base64_encode() them, otherwise the fine details won't be preserved through the web browsers and stackoverflow etc... — goat, Dec 03 '11 at 18:39
@chris +1 it seems that with base64 you might be on to something — bottlenecked, Dec 03 '11 at 19:36
Something is really odd here. I always use UTF8 strings without BOM in PHP and it works without any issues. How do you output the variables? do you just do `echo $line`? Are you outputting a whole webpage, i.e. with doctype, header, etc? Or are you using PHP on the command line? — Jakob Egger, Dec 04 '11 at 11:07
@jakob i use a test.php file in a standalone website (ie no wordpress environment or the like is loaded) that is served with apache2, which i then browse to with firefox. I just do echo $line as you say, and then i progressively tried with meta tags and header() and whatnot to declare utf-8 encoding, in hopes that it was something like this, nothing though. I don't contest that the problem lies somewhere in what i do, i just can't tell what it is! — bottlenecked, Dec 04 '11 at 11:29
Verify the utf-8 http header gets sent to the browser. Use firebug or other firefox addons to check. — goat, Dec 04 '11 at 15:40
@bottlenecked: I don't know if you are doing it already, but try to output valid HTML in your test.php file, i.e. before you write `echo $line`, write something like `echo ' Test Page';`. — Jakob Egger, Dec 04 '11 at 15:54
Another idea: Most browsers allow you to choose the encoding for the current page. Try manually selecting UTF8, and see if that helps. — Jakob Egger, Dec 04 '11 at 15:56
@jakob, surrounding the php code/echo statements with doctype/html/body seemed to do the trick. I will write an update with the solution. Could you incorporate your suggestion in your answer, because i can't mark comments as accepted answer? — bottlenecked, Dec 04 '11 at 16:28

score 4 · Answer 1 · answered Dec 03 '11 at 22:32

4

UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:

ASCII characters stay the same when encoded to UTF-8
no other characters will be encoded to ASCII characters

This means that when you try to split a UTF-8 string by the semicolon character ;, which is an ASCII character, you can just use standard single byte string functions.

In your example, you can just use explode(';',$utf8encodedText) and everything should work as expected.

PS: Since the UTF-8 encoding is prefix-free, you can actually use explode() with any UTF-8 encoded separator.

PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.

answered Dec 03 '11 at 22:32

Jakob Egger

11,981
4
38
48

indeed, explode was what I used at first, and when i coudn't get it to work it later led me to read about mbstrings – bottlenecked Dec 04 '11 at 09:44
Then your problem might be that the output encoding of the html page is not UTF-8. Check if you have `` somewhere in the page header! – Jakob Egger Dec 04 '11 at 09:58
i tried that (it's mentioned somewhere in the overlong problem statement too) but again nada. I also updated the question with new findings again. – bottlenecked Dec 04 '11 at 10:47

hakre · Answer 2 · 2011-12-03T21:34:45.023

1

The mb_split^Docs function should be fine, but you should define the charset it's using as well with mb_regex_encoding^Docs:

mb_regex_encoding('UTF-8');

About mb_detect_encoding^Docs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.

How to remove the BOM mask:

You can filter the string input and remove a UTF-8 bom with a small helper function:

/**
 * remove UTF-8 BOM if string has it at the beginning
 *
 * @param string $str
 * @return string
 */
function remove_utf8_bom($str)
{
   if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF") 
   {
       $str = substr($str, 3);
   }
   return $str;
}

Usage:

$line = remove_utf8_bom($line);

There are probably better ways to do it, but this should work.

edited Dec 03 '11 at 21:34

answered Dec 03 '11 at 17:43

hakre

193,403
52
435
836

I have no problems with your string, actually even a simple explode should work with an UTF-8 encoded string. See http://codepad.viper-7.com/eODqA5 - Looks like you view the result as ISO-8859-*. – hakre Dec 03 '11 at 21:30
using the add_utf8_bom, explode works as expected for each line. If a better (ie less hackish) solution does not come up i will accept this answer – bottlenecked Dec 04 '11 at 10:49
The less hacky way is to save `file.txt` w/o BOM. That's what's suggested first for such problems, see http://unicode.org/faq/utf_bom.html#BOM . Also learn what you need to do in vim to remove the BOM if the file already contains one. `mb_split` works fine in my eyes, as it should preserve the BOM as it's a valid unicode code-point as well: http://www.fileformat.info/info/unicode/char/feff/index.htm - so you better give your application the string that's correctly encoded firsthand or you fix this before parsing or you just continue to use the hack ;) – hakre Dec 04 '11 at 12:28

goat · Answer 3 · 2011-12-03T20:40:36.230

1

Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().

header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);

$peices = mb_split(';', $str);

var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);

Does it? It works as expected for me( bool true, and the strings in the array are correct)

edited Dec 03 '11 at 20:40

answered Dec 03 '11 at 20:04

goat

31,486
7
73
96

yes, it is working just as you say. The problem seems to come up when reading the same line from the file itself – bottlenecked Dec 04 '11 at 10:24
Are you sure you didn't goof when posting the base64_encoded strings? Because the orig base64 string doesn't have a BOM, and I assume you it was supposed to be the value returned directly from fgets, first line too. – goat Dec 04 '11 at 15:42
yep. goofed. It was a "manually copy line from editor then paste into php file as argument to base64_encode" kind of thing, since i didn't at the moment understand full implications of this. Sorry for the red herring :( – bottlenecked Dec 04 '11 at 16:18

score 1 · Accepted Answer · answered Dec 04 '11 at 17:35

When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.

I like to use a PHP file similar to the following:

<!DOCTYPE html>
<html>
  <head>
    <meta charset=utf-8>
    <title>Test page for project XY</title>
  </head>
  <body>
     <h1>Test Page</h1>
     <pre><?php
        echo print_r($_GET,1);
     ?></pre>
  </body>
</html>

If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.

guess that was it! I'm wiser now :P – bottlenecked Dec 04 '11 at 21:21 — bottlenecked, Dec 04 '11 at 21:21

Php cannot find way to split utf-8 strings

4 Answers4