6

Using both htmlspecialchars and htmlentities is causing blank outputs from items such as a symbol and even single ' quotes. Obviously, this is absolutely useless, however outputting the data without using html characters results in this symbol for both �. Any reason why this is occuring?

here is the code that is causing the problem:

<p>
<?php 
    echo nl2br(htmlspecialchars($aboutarray[0]['about_us'], ENT_COMPAT, "UTF-8")); 
?>
</p>
diestl
  • 2,020
  • 4
  • 23
  • 37
JimmyBanks
  • 4,178
  • 8
  • 45
  • 72
  • 2
    Sounds like a charset issue. Are you sure that your data is UTF-8-encoded? – Emil Vikström Jun 26 '12 at 15:13
  • I may be misunderstanding your problem, but I tried this on ideone.com and it seems to work fine: http://ideone.com/P298n – Eric H Jun 26 '12 at 15:14
  • @EmilVikström How do I go about making sure of this? – JimmyBanks Jun 26 '12 at 15:15
  • @EricH yeah it works fine on one of my websites, but for the other with identical code it outputs incorrectly. – JimmyBanks Jun 26 '12 at 15:16
  • 1
    @JimmyBanks You might could try `utf8_encode()`: http://php.net/manual/en/function.utf8-encode.php – Eric H Jun 26 '12 at 15:17
  • Where is `$aboutarray[0]['about_us']` coming from? – deceze Jun 26 '12 at 15:19
  • This will give you the byte sequence of the string: `for ($i = 0; $i < strlen($string); $i++) printf('%d ', ord($string[$i]));` – Emil Vikström Jun 26 '12 at 15:20
  • @EricH using utf8_encode worked, now im confused why this is necessary on one site, but on the other the text outputs properly from the get-go? – JimmyBanks Jun 26 '12 at 15:21
  • @deceze i didnt include the query for the array, but the value is text and I have confirmed output without using `htmlspecialchars` or `htmlentities` – JimmyBanks Jun 26 '12 at 15:22
  • The collation in the database is utf8_general_ci – JimmyBanks Jun 26 '12 at 15:23
  • If `utf8_encode` worked, that means the data was actually encoded in Latin-1. You may want to read this: http://kunststube.net/frontback/ – deceze Jun 26 '12 at 15:23
  • What's the connection charset (set during connection to the database)? – Emil Vikström Jun 26 '12 at 15:24
  • The data i took was copy pasted from an old site (migrating the site, not stealing anything). Would that be a possible reason that the text from the site was encoded in Latin? – JimmyBanks Jun 26 '12 at 15:29
  • @EmilVikström in the header of the site i have the meta tag ``, as for the database connection I am using ADODB which doesnt have any issues on the first site. I havent specified a encoding as far as i know. – JimmyBanks Jun 26 '12 at 15:31
  • upon further inspection, utf8_encode is just removing the trademark symbol – JimmyBanks Jun 26 '12 at 15:36
  • @JimmyBanks Check out `mb_detect_encoding()`: http://php.net/manual/en/function.mb-detect-encoding.php or `mb_check_encoding()`: http://www.php.net/manual/en/function.mb-check-encoding.php. Those may be of assistance in tracking down the issue. – Eric H Jun 26 '12 at 18:12

1 Answers1

14

That string is not encoded in valid UTF-8 encoding. It could be in another encoding like UTF-16 or perhaps it just contains some binary junk that doesn't correspond to any format.

The bottom line is that, since you specified "UTF-8" as the encoding type parameter of htmlspecialchars(), it will return an empty string if the string does not comply with "UTF-8". It states this in the PHP manual.

A simple fix is to use the substitute or ignore flag. Change:

htmlspecialchars($aboutarray[0]['about_us'], ENT_COMPAT, "UTF-8")

To:

htmlspecialchars($aboutarray[0]['about_us'], ENT_COMPAT|ENT_SUBSTITUTE, "UTF-8")

Or:

htmlspecialchars($aboutarray[0]['about_us'], ENT_COMPAT|ENT_IGNORE, "UTF-8")

Note: ENT_IGNORE removes the non-compliant bytes. This could cause a security issue. It's better to truly understand the contents of your string and how it's being encoded. Correct the source of the problem rather than use the simple ENT_IGNORE fix.

You should ask yourself why your string is not encoded in UTF-8... it should be, but it's not.

I happen to have just encountered this problem as well; you can read details on why an empty string is being returned here.

Community
  • 1
  • 1
Lakey
  • 1,948
  • 2
  • 17
  • 28