I have a forum style text box and I would like to sanitize the user input to stop potential xss and code insertion. I have seen htmlentities used, but then others have said that &,#,%,: characters need to be encoded as well, and it seems the more I look, the more potentially dangerous characters pop up. Whitelisting is problematic as there are many valid text options beyond ^a-zA-z0-9. I have come up with this code. Will it work to stop attacks and be secure? Is there any reason not to use it, or a better way?
function replaceHTML ($match) {
return "&#" . ord ($match[0]) . ";";
}
$clean = preg_replace_callback ( "/[^ a-zA-Z0-9]/", "replaceHTML", $userInput );
EDIT:_____________________________ I could of course be wrong, but it is my understanding that htmlentities only replaces & < > " (and ' if ENT_QUOTES is turned on). This is probably enough to stop most attacks (and frankly probably more than enough for my low traffic site). In my obsessive attention to detail, however, I dug further. A book I have warns to also encode # and % for "shutting down hex attacks". Two websites I found warned against allowing : and --. Its all rather confusing to me, and led me to explore converting all non-alphanumeric characters. If htmlentities does this already then great, but it does not seem to. Here are results from code I ran I copied after clicking view source in firefox.
original (random characters to test): 5:gjla#''*&$!j-l:4
preg_replace_callback: <b>5:</b>gjla<hi>#''*&$!j-l:4
htmlentities (w/ ENT_QUOTES): <b>5:</b>gjla<hi>#''*&$!j-l:4
htmlentities appears to not be encoding those other characters like : Sorry for the wall of text. Is this just me being paranoid?
EDIT #2: ___________