1

I need to write a util to add a tag before any

Test string points to <p><a href="http://www.acdevents.com/au2005/">Acd Event</a> with an image <a href="http://www.acdevents.com"><img src="image.jpg"></a>

This needs to be changed to

Test string points to <p><span class="test_class"><a href="http://www.acdevents.com/au2005/">Acd Event</a></span> with an image <a href="http://www.acdevents.com"><img src="image.jpg"></a>

As you can see the tag needs to be added only in case of a url pointing to a physical page and not if its an image.

I was planning to use regex to achieve this, but w/o any luck so far.

Any pointer on this will be highly appeciated.

-Thanks

Shamik
  • 1,671
  • 11
  • 36
  • 64
  • Trying to do this with a regex sounds painful. Maybe you could use [XSLT](http://www.w3schools.com/xsl/)? Are you working with well-formed HTML documents, or tags embedded in plain text (like in the example)? – Brandon Bohrer Mar 17 '11 at 21:47
  • 2
    regex + html = pain. Use DOM instead: http://stackoverflow.com/questions/3524431/wrap-dom-element-in-another-dom-element-in-php – Marc B Mar 17 '11 at 21:47
  • I'm feeling the pain of using regex but there's no other way out. For some weird reason, I'm receiving html bosy text as String from a different service. I need to do some formatting and pre-processing, part of which is the question I had put. There's no scope for a XSLT. – Shamik Mar 17 '11 at 21:55
  • I agree with Brandon: regular expressions aren't the right tool for the job. I'd advise the use of a parser such as John Cowan's 'TagSoup' to write some code to filter the HTML. If you prefer something more DOM-like than SAX-like, there's NekoHTML. – Keith Gaughan Mar 17 '11 at 21:55

2 Answers2

2

Turning my comment into an answer, regular expressions aren't the right tool for the job. I'd advise the use of a parser such as John Cowan's 'TagSoup' to write some code to filter the HTML. If you prefer something more DOM-like than SAX-like, there's NekoHTML.

If you're absolutely certain you want to go down the regular expression route and you're using PCRE or another regex engine that supports look-ahead, you can use assertions, thus this regex may do the job for you:

s.replaceAll("<a[^>]*?>(?!<img.*)(.+?)</a>", "<span class=\"test_class\">$0</span>");

I haven't tested that, but the gist is correct. The important thing there is (?!<img.*), which asserts that you don't want to match <img followed by anything at that position. That may do the job for you, but I'm still of the opinion that a little bit of parsing is the best route.

Keith Gaughan
  • 21,367
  • 3
  • 32
  • 30
1

If you have a library like jQuery on the page you could do it with something like this:

$("a").wrap("<span class='test_class' />");

Or if you need to do some check against the URL first:

$("a").each(function(){ 
    var element = $(this);
    var href = element.attr("href");
    if (href.indexOf("http://someUrl") > -1){ 
        element..wrap("<span class='test_class' />");
    }
});

If you don't have jQuery you could do it like this:

var elements = document.body.getElementsByTagName("a");
for (var i = 0; i < elements.length; i++) {
    var element = elements[i];
    var clone = element.cloneNode(true);
    var parent = element.parentNode;

    var span = document.createElement("span");
    span.setAttribute("class", "test_class");
    span.appendChild(clone);
    parent.replaceChild(span, element); 
}

You could do something very similar in Java using the Document interface:

DocumentBuilder builder = DocumentBuilderFactory.newDocumentBuilder();
Document doc = builder.parse(yourJavaHtmlString);
NodeList nodes = doc.getElementsByTagName("a");
for (int i = 0; i < nodes.getLength(); i++) {
    Element element = (Element) nodes.item(i);
    String href = element.getAttribute("href");
    if (!href.equals("http://www.acdevents.com")) {
        Element clone = element.cloneNode(true);
        Element parent = element.getParentNode();

        Element span = doc.createElement("span");
        span.setAttribute("class", "test_class");
        span.appendChild(clone);
        parent.replaceChild(span, element);
    }
}
Adam Ayres
  • 8,732
  • 1
  • 32
  • 25