0

I need regex to select complete anchor tag except its value.

I have tried using the below regex but no luck

(<a\s\b(href|title)\b.*\">)?|(<[\/]a>) for the below use cases

1.<a href=\"http://www.ags.ny.gov/\">www.ags.ny.gov</a>

2.<a title=\"ba.com/redeem\" href=\"http://ba.com/rertem\" target=\"_blank\" rel=\"nkiops noreferrer\">ba.com/rertem</a>.

3.<a href=\"http://www.dfs.ny.gov/\">www.ags.ay.gov</a>, for free information

I expect the output it should only selects the anchor tag starts with title or href however it is selecting the anchor tag at the end though the first condition not satisfied reference link: https://regex101.com/r/VcAS6l/1

Nick
  • 21
  • 6

2 Answers2

2

I am going to assume that you actually want to find anchor tags in a larger document, and that you will want the process to be accurate and relatively efficient.

Matching against a string that contains (just) a particular kind of opening anchor tag or a closing anchor tag is not useful. Especially since in the first case you don't check that it is well-formed (see comment about '=' and '"') or extract the anchor's URL in the regex.

Lets analyze your regex:

  (<a\s\b(href|title)\b.*\">)?|(<[\/]a>) 

That is an optional group matching a <a ...> tag OR a non-optional group matching a <\a> tag. It will happily match no instances of the optional group; i.e. nothing at all. The ? is probably misplaced.

Now looking at this

  <a\s\b(href|title)\b.*\">

That says:

  1. '<'
  2. 'a'
  3. A space character
  4. A word boundary
  5. A group consisting of "href" or "title"
  6. A word boundary
  7. Zero or more characters
  8. '"'
  9. '>'

A minor problem with that is that 4. is redundant.

A larger problem is that you don't explicitly match the '=' and '"' that should follow the href or title attribute name.

The largest problem is in 7. The '*' in '.*' is a greedy quantifier. It tries to match as much as possible. So in practice it will match all the way to the last '"' and '>' in your document. That's wrong.

To fix the largest problem you needs to use a reluctant quantifier. One that matches as few characters as it can get away with. For example:

    .*?"

will (initially) stop matching at the first '"' that it sees.


Lessons:

  1. It is a bad idea to use regexes to parse structured documents. HTML is particular difficult, because:

    • there is so much legal variability in the syntax of an HTML document
    • many HTML documents you will find in the wild are malformed.

      Instead, use a proper parser. For example, the Jsoup parser is a good option for parsing HTML documents that may be syntactically invalid. Instead of rejecting a document out of hand, it will try to (internally) correct the errors.

  2. If you are going to "borrow" someone else's regexes, you are relying on their ability to right correct regexes, and your ability to understand if their regex is (really) applicable to your problem. (Did they do it correctly? Are the assumptions that they may have made valid in your use-case?)

  3. If you are going to attempt to write your own regexes to parse complicated documents, you need to understand the (Java) regex language. There are some nasty traps; e.g. eager quantification, and catastrophic backtracking.

  4. If you have to debug regexes, you need to treat this like any other code debugging problem:

    • Make sure you understand the language (of regexes)
    • Read your code (regexes) carefully.
    • Explain your code (regexes) to your Rubber Duck. (Not a joke.)
    • and so on.

If that sounds too hard, don't use regexes for complicated problems.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

This expression might be an option to look into:

<a\s+(?:href|title)=[^>]*>([^<]*)<\/a>

Demo

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class re{

    public static void main(String[] args){

        final String regex = "<a\\s+(?:href|title)=[^>]*>([^<]*)<\\/a>";
        final String string = "<a href=\\\\\\\"http://www.dfs.ny.gov/\\\\\\\">www.dfs.ny.gov</a>, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>\";\n\n"
             + "<a title= \"some title\" href=\\\\\\\"http://www.dfs.ny.gov/\\\\\\\">www.dfs.ny.gov</a>, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>\";\n\n"
             + "<a nottitle= \"some title\" href=\\\\\\\"http://www.dfs.ny.gov/\\\\\\\">www.dfs.ny.gov</a>, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>\";\n\n\n"
             + "<a id=\\\"OLE_LINK2\\\" class=\\\"bookmark\\\" title=\\\"OLE_LINK2\\\" name=\\\"OLE_LINK2\\\"></a>\n\n";
        final String subst = "$1";

        final Pattern pattern = Pattern.compile(regex);
        final Matcher matcher = pattern.matcher(string);

        final String result = matcher.replaceAll(subst);

        System.out.println(result);


    }
}

Output

www.dfs.ny.gov, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>";

www.dfs.ny.gov, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>";

<a nottitle= "some title" href=\\\"http://www.dfs.ny.gov/\\\">www.dfs.ny.gov</a>, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>";


<a id=\"OLE_LINK2\" class=\"bookmark\" title=\"OLE_LINK2\" name=\"OLE_LINK2\"></a>

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here


If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Emma
  • 27,428
  • 11
  • 44
  • 69