0

I have a string where I have to replace some content:

"...content... <a href='document/link/B1'>foo</a> ...content... <a href='document/link/B2'>bar</a> ..."

I'm looking for a clean way to obtain something like this:

"...content... <a href='document/link/23'>foo</a> ...content... <a href='document/link/24'>bar</a> ..."

Where '23' and '24' in the links are results of some processing that I did. So first I should be able to select the links, get their url (more specific: I need the B1 and B2) and then I have to perform some actions with e.g. B1 which results in '23' which I then have to insert back again in the string.

Is there a nice way to achieve this?

user485659
  • 689
  • 4
  • 12
  • 20

1 Answers1

1

In general, it's bad idea to use regex to parse HTML/XML. But for some sporadical use (run just once) and if you are sure about the structure of your HTML and don't require much robustness, something like this (based on this) could do the trick:

   String original = "..content... <a href='document/link/B1'>foo</a> ...content... <a href='document/link/B2'>bar</a> ...";
   StringBuffer sb = new StringBuffer();
   // tweak the following
   Pattern pattern = Pattern.compile("(<a href='document/link/)([^']*)('>)");
   Matcher matcher = pattern.matcher(original);
   while(matcher.find()) {
      String oldlinkPart = matcher.group(2);
      String newlinkPart = buildNewLinkPart(oldLinkPart); // here you do your look-up
      matcher.appendReplacement(sb, matcher.group(1) + newlinkPart + matcher.group(3));
   }
   matcher.appendTail(sb);
   String modified = sb.toString();

You can tweak the regex pattern to be slightly more general (more spaces, tabs, additional atttributes inside the A tag, case sensitivity, doble quotes), but when you start to pretend to be totally general, so that your code works with any well formed HTML, then you're screwed: try instead with a XML/DOM parser.

Community
  • 1
  • 1
leonbloy
  • 73,180
  • 20
  • 142
  • 190
  • Thanks for your answer, I had to make a small adjustment: had to change the 2nd group to `([A-Z][0-9]*)` to make it work. – user485659 Apr 13 '12 at 18:50