0

I'm looking for a C/C++ functional equivalent to HTML::Defang, and my Google-fu has not been able to uncover anything. I want to keep any benign tags and strip out/defang everything else. Lacking an actual library, any pointers to complete lists of tags/attributes/etc to defang would be appreciated. I know of http://en.wikipedia.org/wiki/DOM_Events. Thanks.

lucideer
  • 3,842
  • 25
  • 31

2 Answers2

1

In Java, I use JTidy to clean up HTML. I'm not sure if it would suit your needs, but if you Google for JTidy you can follow the link to a C/C++ implementation as well, and see if it does what you want.

As for what to defang: Look at the W3C specs for HTML; any tag not in there doesn't belong in HTML. But again, I could be misunderstanding your "defang" concept.

Carl Smotricz
  • 66,391
  • 18
  • 125
  • 167
  • Basically what I want is what web-based email systems do when presented with HTML email. Display what they can, nuke the rest, including any attacks. –  Dec 17 '09 at 19:16
  • This is more an art than a science. I think you'd do well to let Tidy strip out any scripts. But I can't evaluate Tidy for you. Try it! – Carl Smotricz Dec 17 '09 at 19:18
1

libxml2 is free and should do what you want.

http://www.xmlsoft.org/

See this part of the API: http://www.xmlsoft.org/html/libxml-HTMLparser.html

The htmlReadFile() function might do the trick.

To get you started with libxml2 some examples can be found here:

http://www.xmlsoft.org/examples/index.html

jcoffland
  • 5,238
  • 38
  • 43