5

I am using owasp-java-html-sanitizer and try to add id-attributes to each h2-tag in my HTML Code, which should be persistent over several page loads but unique for each element on the page(as defined for id-attributes). I tried to count all elements to get an index and to add the index to every h2 element. However, I have no access to this data at this point in java. Then I used UUID.randomUUID(), however as it is random, the id is not persistent.

Here is the code I have currently:

public PolicyFactory HtmlPolicy() {
    return new HtmlPolicyBuilder()
        .allowElements("h3", "h4", "h5", "h6", "p", "span", "br", "b", "strong", "i", "em", "u", "hr", "ol", "ul", "li",
                       "img", "table", "tr", "th", "td", "thead", "tbody", "tfoot", "caption", "colgroup", "col", "blockquote", "figure", "figcaption", "object", "iframe")
        .allowElements(
            (String elementName, List<String> attrs) -> {
                String uniqueID = UUID.randomUUID().toString();
                // Add an attribute.
                attrs.add("id");                
                attrs.add("headline-" + uniqueID);
                attrs.add("class");
                attrs.add("scrollspy");
               // Return elementName to include, null to drop.
               return elementName;
            }, "h2")
        .toFactory();
}

In javascript I would do it as follows:

        $('h2').each(function(index, obj) {
            let newObj = $(obj)[0];
            $(newObj).attr('id', `headline-2-${index + 1}`);
        });

Does anyone have an idea of an approach to increment one on every h2-element in this szenario?

Nixen85
  • 1,253
  • 8
  • 24
  • Reconsider the necessarity, as you can use `h2:nth-of-type(index)` to address a H2 on the client side. – Christoph Dahlen Nov 06 '22 at 12:20
  • My initial approach was to add the id on client side via javascript. However, in google search results, they deep link to h2-headlines directly. If I add id with javascript, googles' deep links will not work. – Nixen85 Nov 07 '22 at 14:14
  • Can't you read a value of an `attrs` from the h2 and use that to build a unique id? – morganney Nov 12 '22 at 04:19

2 Answers2

0

I don't know owasp-java-html-sanitizer, but I think you can use additional variable to store index of used Id - like AtomicInteger. For you code it may looks like this:

public PolicyFactory HtmlPolicy() {
    AtomicInteger index = new AtomicInteger();
    return new HtmlPolicyBuilder()
        .allowElements("h3", "h4", "h5", "h6", "p", "span", "br", "b", "strong", "i", "em", "u", "hr", "ol", "ul", "li",
                       "img", "table", "tr", "th", "td", "thead", "tbody", "tfoot", "caption", "colgroup", "col", "blockquote", "figure", "figcaption", "object", "iframe")
        .allowElements(
            (String elementName, List<String> attrs) -> {
                // Add an attribute.
                attrs.add("id");                
                attrs.add("headline-" + index.incrementAndGet());
                attrs.add("class");
                attrs.add("scrollspy");
               // Return elementName to include, null to drop.
               return elementName;
            }, "h2")
        .toFactory();
}
tchudyk
  • 564
  • 4
  • 14
  • The idea is basically right, but it increments the integer across page sessions, meaning for a page with 5 headlines for the initial page load the first h2 is marked headline-1, which is headline 6 for the second page load and headline-11 for the third... It must be equal integers across page loads... – Nixen85 Nov 30 '22 at 12:37
  • So somehow I need to reset the counter after the content of one page is sanitized. – Nixen85 Nov 30 '22 at 18:03
0

The solution provided by @tchudyk could work properly: I am not completely sure about that because HtmlPolicyBuilder is not thread safe, and I guess if the library will maintain the order of iteration over the H2 headers every single time.

As an alternative approach, consider the use of JSoup, a Java HTML parsing and manipulation library.

The idea would be to manipulate your HTML with JSoup in order to include the necessary id attributes in your page prior sanitizing it with OWASP.

For example, you could extract the h2 tags and then set the id attributes as appropriate:

String html = "...";
Document doc = Jsoup.parse(html);
Elements h2Tags = doc.select("h2"); // analogous to your example in JQuery
// Iterate, in a deterministic way, over results
int index = 0;
for (Element h2Tag : h2Tags) {
  h2Tag.attr("id", "headline-2-" + index++);
}
String transformedHtml = doc.outerHtml();

Although I think it is less powerful than OWASP, the library provides some sanitization funtionality as well.

jccampanero
  • 50,989
  • 3
  • 20
  • 49