0

I am searching for a way to allow an user to format his text. The formatting is limited to:

  • underline
  • italic
  • bold
  • enumeration

I would like to use Markdown and convert the Markdown to HTML on serverside.

My problem is that Markdown supports a lot of more formatting than I want to allow (headings, tables, ..).

Do you know a Markdown library where I can whitelist underline/italic/bold/..?

If there is no whitelisting, I thought about cleaning up the resulting HTML with JSOUP. Is that a preferred way?

Thank you.

NoobieNoob
  • 887
  • 2
  • 11
  • 31
  • Generally this is accomplished with an HTML sanitizer. Use a full featured Markdown parser, then pass the output though the HTML sanitizer which strips out all non-whitelisted HTML tags. – Waylan Mar 23 '17 at 12:58
  • @Waylan, you could add your comment as an answer. –  Sep 17 '17 at 16:37
  • @Hal9k I've added an answer which suggests a few different approaches. – Waylan Sep 17 '17 at 20:13

1 Answers1

0

There are a few different ways this could be accomplished. Which you chose depends on which Libraries you use (suggesting specific tools is off-topic on StackOverflow) and exactly what behavior you are looking for. You can find a summary of each approach below.

Modify a Markdown parser.

Some parsers provide an API to allow you to modify their behavior. You could perhaps remove the bits and pieces which parse tables, headers, etc. and leave the rest in place. Your final output would then leave in any Markdown syntax for those features. For example, if the author types a header, they would get a paragraph which begins with hashes.

Create a custom renderer.

Some Markdown parsers work in two steps. In step 1, the parser takes the Markdown text and outputs an Abstract Syntax Tree (AST) and in step 2, the renderer accepts an AST and outputs HTML. You could either modify the default renderer or build a custom renderer which handles each element as you wanted. For example, you can tell the "header" renderer method to output a paragraph (rather than a header) and you can choose whether that paragraph includes the original hashes or not.

Use an HTML Sanitizer.

Use your Markdown parser of choice, passing the text in and taking the output without modification. Then pass the HTML output into an HTML sanitizer, which will strip out any tags not in a whitelist. In this scenario there will be no clue that a header used to be a header. In the final output it will simply look like a regular paragraph.

Waylan
  • 37,164
  • 12
  • 83
  • 109