0

If I run this code:

<?php
$string = 'My string &lsquo;to parse&rsquo;';
$string_decoded = html_entity_decode($string, ENT_QUOTES, 'utf-8');
$string_array = str_split($string_decoded);
var_dump($string_array);
?>

I get this result:

array (size=28)
  0 => string 'M' (length=1)
  1 => string 'y' (length=1)
  2 => string ' ' (length=1)
  3 => string 's' (length=1)
  4 => string 't' (length=1)
  5 => string 'r' (length=1)
  6 => string 'i' (length=1)
  7 => string 'n' (length=1)
  8 => string 'g' (length=1)
  9 => string ' ' (length=1)
  10 => string '�' (length=1)
  11 => string '�' (length=1)
  12 => string '�' (length=1)
  13 => string 't' (length=1)
  14 => string 'o' (length=1)
  15 => string ' ' (length=1)
  16 => string 'p' (length=1)
  17 => string 'a' (length=1)
  18 => string 'r' (length=1)
  19 => string 's' (length=1)
  20 => string 'e' (length=1)
  21 => string '�' (length=1)
  22 => string '�' (length=1)
  23 => string '�' (length=1)

As you can see, instead of the decoded single quotes (left/right), I'm getting these three characters for each quote...

I noticed that this happens with some entities, but not others. A few that present this issue are &lsquo; &rdquo; $copy;. Some that don't present the same problem are &amp; $gt;.

I tried different charsets but couldn't find one that would work for all.

What am I doing wrong? Is there a way to make it work for all entities? Or at least all the "common" ones?

Thanks.

Dentra Andres
  • 371
  • 1
  • 7
  • 18
  • 1
    From the [PHP Docs](http://php.net/manual/en/function.str-split.php): `Note: str_split() will split into __bytes__, rather than characters when dealing with a multi-byte encoded string.` (my emphasis).... smart quotes like `‘` are being translated to a multi-byte character – Mark Baker Dec 04 '15 at 23:24
  • You should use `mb_split`: http://php.net/manual/en/function.mb-split.php for utf-8 – Zefiryn Dec 04 '15 at 23:25

1 Answers1

1

This should do well:

function mb_str_split($string) {
    return preg_split('/(?<!^)(?!$)/u', $string );
}
$string = 'My string &lsquo;to parse&rsquo;';
$string = utf8_encode($string);
$string_decoded = html_entity_decode($string, ENT_QUOTES, 'utf-8');
$string_array = mb_str_split($string_decoded);
var_dump($string_array);

As mentioned in comments: you need to split the string with mb_split or by regex.

Proof: https://3v4l.org/3FRmG

jankal
  • 1,090
  • 1
  • 11
  • 28