0

I'm using CKEditor to let users enter rich text and even embedded images. That content is sent to other users. How can I prevent any kind of malicious injection like XSS? I think I just need to clean the HTML removing all possible scripting at server side, but I can't find any tested tool to do that. Even GWT's SafeHTMLUtils won't work cause it modifies the HTML too much breaking user intended input.

Edit:

I've found a sanitizer called Jsoup. It does exactly what I need. But even in relaxed mode it's removing img tags with embedded images.

Federico Pugnali
  • 655
  • 8
  • 18

2 Answers2

2

I managed to clean my HTML input with Jsoup this way:

Jsoup.clean(dirtyHTML, 
                Whitelist.relaxed()
                .addProtocols("img","src","data")
                .addAttributes(":all", "style")
                .addTags("span")));

It accepts any img with src content starting with "data:". It's ok for now, but I asked a question to find a way to just accept the CKEditor generated content "data:;base64".

To display the sanitized HTML data to the receiving user we are using a sandboxed iframe to avoid css disasters (like a fixed position image covering all the page).

<iframe sandbox="allow-same-origin">Sanitized HTML here inside body tag</iframe>
Community
  • 1
  • 1
Federico Pugnali
  • 655
  • 8
  • 18
1

It is very hard to separate good HTML from bad one in an automatic way. I would not trust any tool even they claim to be secure. Such a separation would not be limited to checking which tags or attributes are used and block some like script tag or event handler attributes (like img.onerror). There are lots of techniques that benefit from browser's way of parsing/handling HTML. New exploit methods are introduced every day.

I believe the safest way is to use a Markdown editors, like the one used here on Stackoverflow.

You can find some references here: JQuery/JS Markdown plugin?

Community
  • 1
  • 1
mesutozer
  • 2,839
  • 1
  • 12
  • 13
  • Thanks for the info. I've been reading about PageDown used here. But "It should be noted that Markdown is not safe as far as user-entered input goes. Pretty much anything is valid in Markdown, in particular something like . This PageDown repository includes the two plugins that Stack Exchange uses to sanitize the user's input; see the description of Markdown.Sanitizer.js below". I think we have no other solution than trust in some sanitizer tool. – Federico Pugnali Mar 16 '14 at 22:15
  • I think it would be easier to use Markdown + A sanitizer that removes html completely. In addition to removing (or trying to remove) HTML from user input, this sanitizer can htmlencode given input, then apply markdown rules to add some html. This way, it is guaranteed that even if user could pass some from removal phase, that html will be encoded in output. – mesutozer Mar 16 '14 at 22:19
  • I can't completely remove HTML in my case. The whole point of the functionality is to let users send HTML ready articles to other users. I think I will be ok with something like jsoup cleaning just scripts, but I would like to keep embedded images. – Federico Pugnali Mar 16 '14 at 22:25
  • I am not insisting, please do not get me wrong. I just want to be clear. When markdown is used no html tags appear in user input. This is the way markdown works. It has some conventions like when a word appears between two stars (*) it should be rendered as bold. So normally, user supplied markdown data does not include any html. Sanitiser can remove whole HTML at this moment. Then it HTML encodes input string. Then converts markdown conventions to real html tags (like converting * to ). – mesutozer Mar 16 '14 at 22:30
  • According to this: http://michelf.ca/blog/2010/markdown-and-xss/ the problem is still sanitizing HTML – Federico Pugnali Mar 16 '14 at 22:45