3

My recent submission for Firefox add-on site (based on Firefox Add-on SDK 1.10) was rejected because I have not sanitized the input I use and was suggested to use nsIParserUtils.

I found the function parseHTML(doc, html, allowStyle, baseURI, isXML) in that page. I changed it to:

function parseHTML(doc, html, allowStyle, baseURI, isXML) {
    var parser = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils);
    var f =  parser.parseFragment(html, allowStyle ? parser.SanitizerAllowStyle : 0,
                                        !!isXML, baseURI, doc);
    return f;
}

And the first parameter in that is said to be a document element. I have no idea what that is supposed to be? I tried document.createDocumentFragment() but I get "ReferenceError: document is not defined" error. Can some one help me on how to call this function?

And the function returns an nsIDOMDocumentFragment. How to convert that back to a string?


UPDATE:

As suggested by @zer0 I used:

var parser = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils);
var sanitizedHTML = parser.sanitize(html, flags);

But it defeats the purpose of what I wanted to do. For example:

<html><head><BASE href='http://localhost/t/h.html' />
<link rel="stylesheet" type="text/css" href="h.css">
<style type="text/css">
.b{
    color:green;
}
</style>
<base href="http://foo.example.com/">
</head><body>Sample Text. No Style
<script>Hello malicious code</script>
<p class="a">External Style</p>
<p class="b">Internal Style</p>
<p style="color:blue">Inline Style</p>

<a href="sample.html">Link</a><br><br><div style='color: #666666; font-size: 12px'>Clipped on 6-October-2012, 07:37:39 PM from <a href='http://localhost/t/h.html'>http://localhost/t/h.html</a> </div></body></html>

Is converted to:

<html><head>  


<style type="text/css">
.b{

    color:green;
}
</style>



</head><body>Sample Text. No Style

<p class="a">External Style</p>
<p class="b">Internal Style</p>
<p style="color:blue">Inline Style</p>

<a>Link</a><br><br><div style="color: #666666; font-size: 12px">Clipped on 6-October-2012, 07:37:39 PM from <a href="http://localhost/t/h.html">http://localhost/t/h.html</a> </div></body></html>

As this strips the external hyperlinks and CSS, it defeats the purpose of the add-on itself. What I want is for just the scripts to be removed:

<html><head><BASE href='http://localhost/t/h.html' /> <BASE href='http://localhost/t/h.html' /> 
<link rel="stylesheet" type="text/css" href="h.css">

<style type="text/css">
.b{

    color:green;
}
</style>
<base href="http://foo.example.com/">


</head><body>Sample Text. No Style
<p class="a">External Style</p>
<p class="b">Internal Style</p>
<p style="color:blue">Inline Style</p>

<a href="sample.html">Link</a><br><br><div style='color: #666666; font-size: 12px'>Clipped on 6-October-2012, 07:37:39 PM from <a href='http://localhost/t/h.html'>http://localhost/t/h.html</a> </div></body></html>

Can someone shed some light on this?

Wladimir Palant
  • 56,865
  • 12
  • 98
  • 126

2 Answers2

3

Links to external styles are removed for a reason: external styles cannot be validated and they might be dangerous (in particular, -moz-binding can be used to run code). Also, the assumption is that you could put the HTML code into a location where following relative links isn't safe (such as mail messages in Thunderbird). Absolute links are always fine however.

What you might want to do is preprocessing the HTML code to remove these issues - resolve relative links and inline references to external styles. Something like this:

// Parse the HTML code into a temporary document
var doc = Cc["@mozilla.org/xmlextras/domparser;1"]
               .createInstance(Ci.nsIDOMParser)
               .parseFromString(html, "text/html");

// Make sure all links are absolute
for (var i = 0; i < doc.links.length; i++)
    doc.links[i].setAttribute("href", doc.links[i].href);

// Make sure all stylesheets are inlined
var stylesheets = doc.getElementsByTagName("link");
for (i = 0; i < stylesheets.length; i++)
{
    try
    {
        var request = new XMLHttpRequest();
        request.open("GET", stylesheets[i].href, false);
        request.send(null);
        var style = doc.createElement("style");
        style.setAttribute("type", "text/css");
        style.textContent = request.responseText;
        stylesheets[i].parentNode.replaceChild(style, stylesheets[i]);
        i--;
    }
    catch (e)
    {
        // Ignore download errors
    }
}

// Serialize the document into a string again
html = Cc["@mozilla.org/xmlextras/xmlserializer;1"]
         .createInstance(Ci.nsIDOMSerializer)
         .serializeToString(doc.documentElement);

// Now sanizite the HTML code
var parser = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils);
var sanitizedHTML = parser.sanitize(html, parser.SanitizerAllowStyle);

Note that I used a synchronous XMLHttpRequest to download stylesheet contents - this has been done for simplicity, your final code should use asynchronous downloads (most likely via request module) that will not hang the user interface.

Wladimir Palant
  • 56,865
  • 12
  • 98
  • 126
  • Hi I get an error inside the catch statement, ReferenceError: XMLHttpRequest is not defined. I tries using request API (https://addons.mozilla.org/en-US/developers/docs/sdk/latest/modules/sdk/request.html) but that is too slow... and most of the time it hangs :( – Jayarathina Madharasan Dec 16 '12 at 09:59
  • @JayarathinaMadharasan: With the Add-on SDK you should use the `request` module. If you have trouble using it then you should create a new question - normally it is neither "slow" nor unstable. – Wladimir Palant Dec 16 '12 at 11:25
  • Thanks a lot for your guidelines. I made it to work based on your code. Specially the one making all links to relative. – Jayarathina Madharasan Dec 29 '12 at 06:44
  • oops sorry, I didn't know that :(.. But did it now... Thanks a lot... :) – Jayarathina Madharasan Jan 03 '13 at 04:26
2

And the first parameter in that is said to be a document element. I have no idea what that is suppose to be?

You don't need that. Just use nsIParserUtils.sanitize method, that just get as input a string and returns as output the sanitized version:

var parser = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils);
var sanitizedHTML = parser.sanitize(html, flags);

Check on the link above the section "Constants" to see which flags you need to have in your scenario.

ZER0
  • 24,846
  • 5
  • 51
  • 54