11

Ok, so I have been reading about markdown here on SO and elsewhere and the steps between user-input and the db are usually given as

  1. convert markdown to html
  2. sanitize html (w/whitelist)
  3. insert into database

but to me it makes more sense to do the following:

  1. sanitize markdown (remove all tags - no exceptions)
  2. convert to html
  3. insert into database

Am I missing something? This seems to me to be pretty nearly xss-proof

psb
  • 111
  • 1
  • 3
  • 4
    Note that both procedures are flawed. It's better to store the Markdown in the database and convert it to HTML on output. Among other things, this makes it easier for the user to edit the Markdown later. – Marnen Laibow-Koser Dec 25 '13 at 16:55

5 Answers5

24

Please see this link:

http://michelf.com/weblog/2010/markdown-and-xss/

> hello <a name="n"
> href="javascript:alert('xss')">*you*</a>

Becomes

<blockquote>
 <p>hello <a name="n"
 href="javascript:alert('xss')"><em>you</em></a></p>
</blockquote>

∴​ you must sanitize after converting to HTML.

Jordan Reiter
  • 20,467
  • 11
  • 95
  • 161
6

There are two issues with what you've proposed:

  1. I don't see a way for your users to be able to format posts. You took advantage of Markdown to provide nice numbered lists, for example. In the proposed no-tags-no-exceptions world, I'm not seeing how the end user would be able to do such a thing.
  2. Considerably more important: When using Markdown as the "native" formatting language, and whitelisting the other available tags,you are limiting not just the input side of the world, but the output as well. In other words, if your display engine expects Markdown and only allows whitelisted content out, even if (God forbid) somebody gets to the database and injects some nasty malware-laden code into a bunch of posts, the actual site and its users are protected because you are sanitizing it upon display, as well.

There are some good resources on the web about output sanitization:

Gerard Roche
  • 6,162
  • 4
  • 43
  • 69
John Rudy
  • 37,282
  • 14
  • 64
  • 100
  • 2
    As for point #1, I think you misunderstood OP. You would still use Markdown-style numbered lists, no problem, because the HTML tag removal would happen *before* Markdown converted `1. Foo` into `
  • Foo
  • `. – Alan H. Mar 10 '11 at 23:18
  • OP seems to suggest storing the HTML in the DB, which would make it impossible to edit since the HTML tags that were stored are considered invalid when supplied as input. – Stijn de Witt Nov 27 '18 at 23:12
  • ... which reminds me of one of the 'rules' I like to use when developing: "accept your own output as input". Don't you just hate it when you copy paste the output (of say an account number) into the input field and it won't accept it before you remove or add something to it first? You see this often with phone nu,bers where they will output it like e.g. `+31 555 1234 5678`, then tell you the plus symbol is an illegal character or that it's too long or contains spaces when you try to input it. – Stijn de Witt Nov 27 '18 at 23:15