working with UTF-8 encoded text

Question

I have a problem. I need to find some utf-8 characters from my text file and output them, but it doens't output the letters, instead it outputs "?", questionmarks...

ini_set( 'default_charset', 'UTF-8' );
$homepage = file_get_contents('t1.txt');
echo $homepage;
echo "\t";
echo "\t!!!!!!!!!!!!"; 
echo $homepage[14];

so, here it is very strange, if I'm using exsisting index it outputs nothing, but if I put

echo $homepage[35];

it outputs "?", but my $homepage string is only 30 charecters long, what's wrong?

It is very strange, it takes the string from file correctly, and outputs it correctly, but when I call for the character by index, it doesn't work.. here is what's in my text file: advhasgdvgv олыолоываи ouhh

and it outputs it correctly, when I just call $homepage, but when $homepage[14] it doesn't work.Here is output:

advhasgdvgv олыолоываи ouhh !!!!!!!!!!!!

Wouldn't that be because Unicode characters are stored in more than 1 byte, so accessing a character like that would only get the first byte? — Supericy, Feb 04 '13 at 19:49

score 0 · Answer 1 · answered Feb 04 '13 at 19:29

0

Try mb_convert_encoding, and see if that fixes the problem.

http://www.php.net/manual/en/function.mb-convert-encoding.php

string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )

$homepage = mb_convert_encoding(
    file_get_contents('t1.txt'),
    "UTF-8"
);

You should also check on the encodings of both the PHP file and the text file you have there.

answered Feb 04 '13 at 19:29

Anh Nhan Nguyen

318
1
8

Now, it's even more strange.. it outputs: ï»¿advhasgdvgv Ð¾Ð»ÑÐ¾Ð»Ð¾ÑÐ²Ð°Ð¸ ouhh !!!!!!!!!!!!g – Hurrem Feb 04 '13 at 19:41
You can then usually just use `echo` or `print` as normal – Anh Nhan Nguyen Feb 05 '13 at 17:20

score 0 · Answer 2 · answered Feb 04 '13 at 19:53

I used this approach for dealing with UTF-8:

<?php
$string = 'ئاکام';//my name
mb_internal_encoding("UTF-8");
$mystring = mb_substr($string,0,1);ئ
//without mb_internal_encoding the return was Ø
echo $mystring;
?>

I also saved all files (Encoding as UTF-8)

score 0 · Answer 3 · answered Feb 04 '13 at 20:04

Unicode characters have more than 1 byte per letter, so you access them you would have to do:

echo $homepage[30] . $homepage[31];
> и

But that is assuming the character is only 2 bytes, but there could be more; so a more general solution would be:

function charAt($str, $pos, $encoding = "UTF-8")
{
    return mb_substr($str, $pos, 1, $encoding);
}

score 0 · Answer 4 · answered Feb 04 '13 at 20:11

PHP does not really support UTF-8 in strings, which means that accessing text[n] will get the n'th byte instead of n'th char. UTF-8 chars might have 1-4 bytes in them, which means that you simply cannot access them by index using PHP, as you don't know what index a char starts from. Also, you obviously cannot retrieve a char using text[n], because it might need multiple bytes.

Depending on what you want, you can either convert the string to ISO 8859 using utf8_decode(), or use some UTF-8-aware mechanism to iterate through the string from the beginning and extract the bytes you want/need.

Be aware that Linux and Windows versions of PHP might produce different output on certain conversions, such as mb_strtoupper(), and that not all regex functions support UTF-8.

working with UTF-8 encoded text

4 Answers4