1

I'm using bleach, which uses html5lib to clean user-generated content that are HTML fragments designed as dust.js templates

everything has worked fine, except for this situation-

input:

<table>
    {#loop}
      <tr>
         <td>{name}</td>
      </tr>
    {/loop}
</table>

output:

    {#loop}
    {/loop}
<table>
      <tr>
         <td>{name}</td>
      </tr>
</table>

the looping tags are being ordered outside of the table. this makes perfect sense - html5lib is correcting my html; content should not be within the table structure unless it's wrapped in a td/th tag. i usually want corrections like this to happen, and still want corrections to happen -- but am wondering if there is a way to somehow get these tags through.

has anyone encountered a similar situation in the past, and been able to suppress this sanitization behavior?

The only approach I've come up with so far, is to wrap the controls in a tag that I can regex out:

<table>
    <tr data-layout=""><td>{#loop}</td></tr>
      <tr>
         <td>{name}</td>
      </tr>
    <tr data-layout=""><td>{/loop}</td></tr>
</table>

the problem with this approach, is that once I regex out this formatting hack, I can't easily build it back in. the encoded template becomes uneditable.

Jonathan Vanasco
  • 15,111
  • 10
  • 48
  • 72

2 Answers2

1

This isn't to do with sanitization at all, this is about parsing (per spec!). Foster-parenting is how the HTML parser handles most content directly within a table element; to change this you'd have to change the parser in html5parser.py. html5lib aims to implement a conforming HTML parser — it doesn't have APIs to make it non-conforming.

gsnedders
  • 5,532
  • 2
  • 30
  • 41
-1

Bleach has built in white-list tags/attributes. However, you are can extend or override existing Whitelisting tags/attributes. Following is an example how you could add custom tags to existing “Whitelisting”;

bleach.ALLOWED_TAGS.extend( ['{#*}' , '{/*}' ] )

Simply bleach will mark "{#loop} {/loop}" tags are safe to escape.

** Bleach official documentation provides details of how to define wild-card Whitelisting tags/attributes.

mahifernando
  • 314
  • 2
  • 8
  • I'm familiar with the `bleach` docs and api. `bleach` doesn't strip these tags out -- they're in the result. The problem is that `html5lib`, which is used by `bleach`, notices that these are invalid html fragment placements ( because they're not valid data within a `table`) and will "fix" the html. the table is rendered "sanitized" by `html5lib`, which just migrates this "text" out of the table. i can't tell if this happens when the parser reads or renders, and am not sure it can be overriden cleanly -- `html5lib` is very particular about what tags/strings can appear in a table, and when. – Jonathan Vanasco May 08 '14 at 00:15