3

If we check out the documentation of the htmlspecialchars() function in PHP, we see that it has an $encoding parameter to specify the encoding of the input string.

Now, conversely, I expect the opposite htmlspecialchars_decode() function to also have an $encoding parameter. However, this is NOT the case.

I want to know why exactly is this the case. There has to be some reason for not including an $encoding parameter in htmlspecialchars_decode().

Surprisingly, there is an $encoding parameter in html_entity_decode(), so what's the point of including it in that function.

coderboy
  • 1,710
  • 1
  • 6
  • 16
  • Very interesting Question, I only can guess: I think it is, because you are able to set default encodings ``ini_set( 'default_charset', 'UTF-8' );`` and it is somewhat expected, that as long as you are "internalize" something you want it, in the format, which is defined as default... Just my guess so... – Mruf Apr 26 '23 at 13:38
  • This is where PHP has its ambiguities :) – coderboy Apr 26 '23 at 13:39

1 Answers1

0

I'd have to guess here slightly, but… htmlspecialchars_decode only decodes a small handful of characters which are all ASCII characters. So there's no need to specify the target encoding you want to decode these characters to, as they're all the same in all ASCII-compatible encodings. Now what if you wanted to decode to a non-ASCII compatible encoding? That is probably virtually never the case, and you can simply do some encoding conversion before and/or afterwards if you really needed that.

PHP has always assumed ASCII for the things that matter to it and arbitrary bytes for anything else that don't matter to it, so this function has never received any unified encoding support, just as a lot of other functions haven't either.

The functions htmlspecialchars and html_entity_decode have received this treatment at some point, as the cases where the encoding does matter are probably encountered more often with them. In the case of html_entity_decode, it decodes a wider range of characters and it does matter what encoding you decode those to.

htmlspecialchars appears to need to know the encoding to properly preserve the string's contents. I don't really understand why, as it would just need to look for certain ASCII bytes to replace, but not passing the correct encoding will garble your non-ASCII text.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • I can relate to what you are saying, but the problem is that `htmlspecialchars()` does have it, and there's just no sense of it having the parameter. – coderboy Apr 26 '23 at 13:47
  • `htmlspecialchars` does seem to do… something… with strings besides just looking for the ASCII bytes of HTML special characters and encoding them. For example, `htmlspecialchars(iconv("UTF-8", "SJIS", "<漢字 &>"))` garbles the input when not passing "SJIS" as the `htmlspecialchars` `$encoding` parameter. I'm not entirely sure what it does, but here we are. – deceze Apr 26 '23 at 14:02
  • You're right. I also noticed that it does more than just look for ASCII bytes. But then I expect `htmlspecialchars_decode()` to do the same, what do you think? – coderboy Apr 26 '23 at 14:34
  • Is there any way to ask this from the developers of PHP, maybe on GitHub? – coderboy Apr 26 '23 at 14:37
  • I imagine digging through the C implementation to see what it does would be a good first step. – deceze Apr 26 '23 at 14:52
  • That often doesn't help :) I've tried it. – coderboy Apr 26 '23 at 15:06
  • Upon looking into the source code, I've found that `htmlspecialchars_decode()` uses ISO-8859-1 by default, for performance. – coderboy Apr 26 '23 at 15:59
  • But still, it's difficult to understand why is a charset used for `htmlspecialchars()`. Can you help me figure it out? – coderboy Apr 26 '23 at 16:03