26

When I use substr() I get a strange character at the end

$articleText = substr($articleText,0,500);

I have an output of 500 chars and � <--

How can I fix this? Is it an encoding problem? My language is Greek.

alimack
  • 611
  • 2
  • 9
  • 20
Stoikidis
  • 265
  • 1
  • 3
  • 4

7 Answers7

61

substr is counting using bytes, and not characters.

greek probably means you are using some multi-byte encoding, like UTF-8 -- and counting per bytes is not quite good for those.

Maybe using mb_substr could help, here : the mb_* functions have been created specifically for multi-byte encodings.

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • 4
    Learning more and more every single day... Thank you stackoverflow ! – Boris Delormas Dec 19 '11 at 10:07
  • 1
    Thank you very much. But as for me the main thing is to add `mb_internal_encoding("UTF-8");` before using `mb_*` functions. Without adding it I still see squares. – ivkremer Dec 27 '13 at 15:46
  • @Kremchik You won't see squares, if you use `mb_substr($short, 0, 75, 'utf-8')`. Then you don't need to use `mb_internal_encoding` before `mb_substr`. – trejder Jun 23 '14 at 12:39
20

Use mb_substr instead, it is able to deal with multiple encodings, not only single-byte strings as substr:

$articleText = mb_substr($articleText,0,500,'UTF-8');
hakre
  • 193,403
  • 52
  • 435
  • 836
Uğur Özpınar
  • 1,033
  • 7
  • 16
  • 2
    "UTF-8" part was important for me - don't forget it peeps! –  Jul 10 '13 at 19:47
  • 1
    "UTF-8" as optional parameter worked for me. Keep in mind that you might also want to use mb_strlen() if you are using the string length to determine if it must be cut. – Kent Munthe Caspersen Jul 15 '13 at 11:20
  • 2
    An alternative is to use `mb_internal_encoding('utf-8')` before any `mb_*` command. – trejder Jun 23 '14 at 12:40
6

Looks like you're slicing a unicode character in half there. Use mb_substr instead for unicode-safe string slicing.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 1
    ...with calling `mb_internal_encoding('utf-8')` before or with using `'utf-8'` as fourth parameters of `mb_substr`. Doc says, that it is optional and when it is omitted, the internal character encoding value will be used, but the think is (explained somewhere else in PHP doc), that PHP's "internal encoding" in nearly always "something else" than your page encoding. So for slicing UTF8 string, this fourth parameter or calling `mb_internal_encoding('utf-8')` becomes required. – trejder Jun 23 '14 at 12:42
1

use this function, It worked for me

function substr_unicode($str, $s, $l = null) {
    return join("", array_slice(
        preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY), $s, $l));
}

Credits: http://php.net/manual/en/function.mb-substr.php#107698

Kerem
  • 11,377
  • 5
  • 59
  • 58
Moussawi7
  • 12,359
  • 5
  • 37
  • 50
0

ms_substr() also works excellently for removing strange trailing line breaks as well, which I was having trouble with after parsing html code. The problem was NOT handled by:

 trim() 

or:

 var_dump(preg_match('/^\n|\n$/', $variable));

or:

str_replace (array('\r\n', '\n', '\r'), ' ', $text)

Don't catch.

Dr Nick Engerer
  • 765
  • 7
  • 10
0

Alternative solution for UTF-8 encoded strings - this will convert UTF-8 to characters before cutting the sub-string.

$articleText = substr(utf8_decode($articleText),0,500);

To get the articleText string back to UTF-8, an extra operation will be needed:

$articleText = utf8_encode( substr(utf8_decode($articleText),0,500) );
Kristoffer Bohmann
  • 3,986
  • 3
  • 28
  • 35
0

You are trying to cut unicode character.So i preferred instead of substr() try mb_substr() in php.

substr()

substr ( string $string , int $start [, int $length ] )

mb_substr()

mb_substr ( string $str , int $start [, int $length [, string $encoding ]] )

For more information for substr() - Credits => Check Here

GowriShankar
  • 1,654
  • 18
  • 30