Whitelisting, preventing XSS with WMD control in C#

Question

Are there any problems with what I am doing here? This is my first time to deal with something like this, and I just want to make sure I understand all the risks, etc. to different methods.

I am using WMD to get user input, and I am displaying it with a literal control. Since it is uneditable once entered I will be storing the HTML and not the Markdown,

input = Server.HTMLEncode(stringThatComesFromWMDTextArea)

And then run something like the following for tags I want users to be able to use.

// Unescape whitelisted tags.
string output = input.Replace("&lt;b&gt;", "<b>").Replace("&lt;/b&gt;", "</b>")
                     .Replace("&lt;i&gt;", "<i>").Replace("&lt;/i&gt;", "</i>");

Edit Here is what I am doing currently:

 public static string EncodeAndWhitelist(string html)
 {
     string[] whiteList = { "b", "i", "strong", "img", "ul", "li" };
     string encodedHTML = HttpUtility.HtmlEncode(html);
     foreach (string wl in whiteList)
         encodedHTML = encodedHTML.Replace("&lt;" + wl + "&gt;", "<" + wl + ">").Replace("&lt;/" + wl + "&gt;", "</" + wl + ">");
     return encodedHTML;
 }

Will what I am doing here keep me protected from XSS?
Are there any other considerations that should be made?
Is there a good list of normal tags to whitelist?

That code won't work for the "IMG" tag since replacing "<img>" doesn't allow for the "src" attribute". — David, Jan 20 '10 at 20:23

score 2 · Accepted Answer · answered Jan 20 '10 at 20:28

If your requirements really are that basic that you can do such simple string replacements then yes, this is ‘safe’ against XSS. (However, it's still possible to submit non-well-formed content where <i> and <b> are mis-nested or unclosed, which could potentially mess up the page the content ends up inserted into.)

But this is rarely enough. For example currently <a href="..."> or <img src="..." /> are not allowed. If you wanted to allow these or other markup with attribute values in, you'd have a whole lot more work to do. You might then approach it with regex, but that gives you endless problems with accidental nesting and replacement of already-replaced content, seeing as how regex can't parse HTML, and that.

To solve both problems, the usual approach is to use an [X][HT]ML parser on the input, then walk the DOM removing all but known-good elements and attributes, then finally re-serialise to [X]HTML. The result is then guaranteed well-formed and contains only safe content.

So, assuming I wanted something more robust, what would you suggest for the parsers you mentioned? Could HTML Agility Pack handle it? Is there not something that does all this already? — Jason, Jan 20 '10 at 20:36
Yes, HTML Agility Pack is a good choice. Once you've got the DOM parsed it's a relatively trivial exercise to write a recursive function that removes all but known-good elements/attributes from the DOM tree. Also if you allow `href`/`src`/etc., remember to check the URLs for known-good schemes like `http`/`https`, to avoid injection through `javascript:` URLs and the like. — bobince, Jan 20 '10 at 20:58

Whitelisting, preventing XSS with WMD control in C#

1 Answers1

Linked