0

I have a problem. I need to find some utf-8 characters from my text file and output them, but it doens't output the letters, instead it outputs "?", questionmarks...

ini_set( 'default_charset', 'UTF-8' );
$homepage = file_get_contents('t1.txt');
echo $homepage;
echo "\t";
echo "\t!!!!!!!!!!!!"; 
echo $homepage[14];

so, here it is very strange, if I'm using exsisting index it outputs nothing, but if I put

echo $homepage[35];

it outputs "?", but my $homepage string is only 30 charecters long, what's wrong?

It is very strange, it takes the string from file correctly, and outputs it correctly, but when I call for the character by index, it doesn't work.. here is what's in my text file: advhasgdvgv олыолоываи ouhh

and it outputs it correctly, when I just call $homepage, but when $homepage[14] it doesn't work.Here is output:

advhasgdvgv олыолоываи ouhh !!!!!!!!!!!!

Hurrem
  • 193
  • 2
  • 4
  • 15
  • Wouldn't that be because Unicode characters are stored in more than 1 byte, so accessing a character like that would only get the first byte? – Supericy Feb 04 '13 at 19:49

4 Answers4

0

Try mb_convert_encoding, and see if that fixes the problem.

http://www.php.net/manual/en/function.mb-convert-encoding.php

string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )

$homepage = mb_convert_encoding(
    file_get_contents('t1.txt'),
    "UTF-8"
);

You should also check on the encodings of both the PHP file and the text file you have there.

Anh Nhan Nguyen
  • 318
  • 1
  • 8
0

I used this approach for dealing with UTF-8:

<?php
$string = 'ئاکام';//my name
mb_internal_encoding("UTF-8");
$mystring = mb_substr($string,0,1);ئ
//without mb_internal_encoding the return was Ø
echo $mystring;
?>

I also saved all files (Encoding as UTF-8)

0

Unicode characters have more than 1 byte per letter, so you access them you would have to do:

echo $homepage[30] . $homepage[31];
> и

But that is assuming the character is only 2 bytes, but there could be more; so a more general solution would be:

function charAt($str, $pos, $encoding = "UTF-8")
{
    return mb_substr($str, $pos, 1, $encoding);
}
Supericy
  • 5,866
  • 1
  • 21
  • 25
0

PHP does not really support UTF-8 in strings, which means that accessing text[n] will get the n'th byte instead of n'th char. UTF-8 chars might have 1-4 bytes in them, which means that you simply cannot access them by index using PHP, as you don't know what index a char starts from. Also, you obviously cannot retrieve a char using text[n], because it might need multiple bytes.

Depending on what you want, you can either convert the string to ISO 8859 using utf8_decode(), or use some UTF-8-aware mechanism to iterate through the string from the beginning and extract the bytes you want/need.

Be aware that Linux and Windows versions of PHP might produce different output on certain conversions, such as mb_strtoupper(), and that not all regex functions support UTF-8.

Jari Karppanen
  • 406
  • 3
  • 6