5

I have 5 greek characters in a string. After I use substr in php the output is something like that α�. It should be αβγ. Any suggestions about encoding? I have tried

header ('Content-type: text/html; charset=utf-8');

with no result.

         <?php
          $string = "αβγδε";
          $thedoc = substr($string, 0, 3); 
          echo $thedoc."<br/>";
        ?>
Chris Baker
  • 49,926
  • 12
  • 96
  • 115

4 Answers4

17
$thedoc = mb_substr($string, 0, 3, 'UTF-8'); 

You need to use mb_substr instead of substr, and you need to set the internal encoding of PHP used in this context to UTF-8.

The substr function is based on a simple character model where each character is one 8-bit byte. Using just substr($string, 0, 3), you get the first 3 bytes of the string. A Greek letter in UTF-8 encoding takes two bytes, so you get alpha (α) and “half of” beta, the first byte in its internal representation, which is not valid UTF-8 data and is thus displayed using the “replacement character” � (an indication of character level data error).

In practice, you could alternatively use substr($string, 0, 6), getting the first 6 bytes (3 characters), but this is an ugly way and relies on the text being specifically in letters that each take 2 bytes in UTF-8, so it would not work e.g. for mixed Latin and Greek text. It is much better to use an approach that can handle any UTF-8 data.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • to be more specific, this problem comes with substr_replace in php http://stackoverflow.com/questions/11239597/substr-replace-encoding-in-php –  Jun 28 '12 at 08:27
  • what a great explanation especially in the second paragraph. thanks – Nikitas Jan 31 '14 at 14:38
3

Please try this and you will solve your problem.

iconv_substr($string, 0, 1, 'utf-8');

1

As you're writing out the characters in your PHP code, be sure to check the encoding of the PHP file itself. For displaying the UTF-8 characters in the browser, you should also include the content-type META tag in the , like so:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Reinder Wit
  • 6,490
  • 1
  • 25
  • 36
0

You can also try forcing the value to be a utf8 string

echo utf8_encode( $thedoc ) . '<br />';