1

I have a forum style text box and I would like to sanitize the user input to stop potential xss and code insertion. I have seen htmlentities used, but then others have said that &,#,%,: characters need to be encoded as well, and it seems the more I look, the more potentially dangerous characters pop up. Whitelisting is problematic as there are many valid text options beyond ^a-zA-z0-9. I have come up with this code. Will it work to stop attacks and be secure? Is there any reason not to use it, or a better way?

function replaceHTML ($match) {
    return "&#" . ord ($match[0]) . ";";
}

$clean = preg_replace_callback ( "/[^ a-zA-Z0-9]/", "replaceHTML", $userInput );

EDIT:_____________________________ I could of course be wrong, but it is my understanding that htmlentities only replaces & < > " (and ' if ENT_QUOTES is turned on). This is probably enough to stop most attacks (and frankly probably more than enough for my low traffic site). In my obsessive attention to detail, however, I dug further. A book I have warns to also encode # and % for "shutting down hex attacks". Two websites I found warned against allowing : and --. Its all rather confusing to me, and led me to explore converting all non-alphanumeric characters. If htmlentities does this already then great, but it does not seem to. Here are results from code I ran I copied after clicking view source in firefox.

original (random characters to test): 5:gjla#''*&$!j-l:4

preg_replace_callback: <b>5:</b>gjla<hi>#''*&$!j-l:4

htmlentities (w/ ENT_QUOTES): <b>5:</b>gjla<hi>#''*&$!j-l:4

htmlentities appears to not be encoding those other characters like : Sorry for the wall of text. Is this just me being paranoid?

EDIT #2: ___________

  • You only need to escape quotes and angle brackets. The other special characters just need escaping in case they encounter unquoted html attributes. – mario Oct 22 '11 at 21:37
  • Stop being paranoid, `htmlentities()` (tries to) replace all characters that have an HTML entity representation, it is enough to stop **all** XSS attacks, the same goes for `htmlspecialchars()` as long as you use `ENT_QUOTES`. – Alix Axel Oct 22 '11 at 23:45
  • thank you, I just need to hear it :). Its hard being self taught and paranoid, with a lot of conflicting information all over. – user1008960 Oct 23 '11 at 02:56
  • @user1008960: Please pick an answer and accepted it if this is solved. – Alix Axel Oct 23 '11 at 21:08
  • possible duplicate of [What's the best method for sanitizing user input with PHP?](http://stackoverflow.com/questions/129677/whats-the-best-method-for-sanitizing-user-input-with-php) – Alix Axel Oct 23 '11 at 21:09

3 Answers3

1

That is exactly what htmlentities does already:

http://codepad.viper-7.com/NDZMa3

It will convert (spaced to prevent stackoverflow double encoding):
"& # amp ;"
to
"& # amp; # amp ;"

evan
  • 12,307
  • 7
  • 37
  • 51
  • it is my understanding that htmlentities only changes < > & " (and ' if ENT_QUOTES is turned on), not all non alphanumeric characters - see output on edited main post. Is this correct? – user1008960 Oct 22 '11 at 22:47
  • htmlspecialchars() does only that set of translations, htmlentities will do full entity translation. Really, either one of these will do the job. You don't need anything else or someone would have fixed them or created a new function that does exactly what these should be doing - which is to help prevent injection. – evan Oct 23 '11 at 16:49
1

All you need to do to stop XSS attacks is use htmlspecialchars().

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • htmlentities is the same as htmlspecialchars: "This function [htmlentities()] is identical to htmlspecialchars() in all ways, except with htmlentities(), all characters which have HTML character entity equivalents are translated into these entities." – evan Oct 22 '11 at 21:42
  • @evan: They aren't the same, `htmlspecialchars()` does a lot less but still manages to defeat all XSS attacks. – Alix Axel Oct 22 '11 at 23:41
  • Well, that quote is directly from http://php.net/manual/en/function.htmlentities.php – evan Oct 23 '11 at 16:37
  • From htmlspecialchars: "Note that this function does not translate anything beyond what is listed above. For full entity translation, see htmlentities()." – evan Oct 23 '11 at 16:47
  • @evan: That's why I said it does a lot less. =) Still, it's enough to stop all XSS vectors. – Alix Axel Oct 23 '11 at 21:03
  • @evan: Unless of course, you're injecting the encoded string directly into certain tag attributes, but if that's the case `htmlentities()` won't save you either... – Alix Axel Oct 23 '11 at 21:06
0

space ' ' can be changed to \s in your regex, also by adding /i at the end of the regex you made it case insensitive, and you don't need manually translate your chars to sequences, it can be done with a callback of htmlentities

$clean = preg_replace_callback('/[^a-z0-9\s]/i', 'htmlentities', $userInput);
  • This answer is crap. You've just posted meaningless code with not even an attempt at an explanation. – Bojangles Oct 22 '11 at 21:56
  • I've made minimal changes, everything I changed already explained by other commentors, nevertheless, thanks for your opinion –  Oct 22 '11 at 22:07
  • It's not working, because the callback gets passed an array, yet htmlentitites expects a string parameter. And it still wouldn't escape *all* input values, just what a regular htmlentities() call might convert. – mario Oct 22 '11 at 22:09