regular expression to select anchor tag not the value

Question

I need regex to select complete anchor tag except its value.

I have tried using the below regex but no luck

(<a\s\b(href|title)\b.*\">)?|(<[\/]a>) for the below use cases

1.<a href=\"http://www.ags.ny.gov/\">www.ags.ny.gov</a>

2.<a title=\"ba.com/redeem\" href=\"http://ba.com/rertem\" target=\"_blank\" rel=\"nkiops noreferrer\">ba.com/rertem</a>.

3.<a href=\"http://www.dfs.ny.gov/\">www.ags.ay.gov</a>, for free information

I expect the output it should only selects the anchor tag starts with title or href however it is selecting the anchor tag at the end though the first condition not satisfied reference link: https://regex101.com/r/VcAS6l/1

not sure anchor tags not showing properly in question 1. www.dfs.ny.gov 2. aa.com/redeem 3. www.dfs.ny.gov — Nick, Sep 15 '19 at 02:37
@Emma thankyou for the response it should not select if anchor tag not started — Nick, Sep 15 '19 at 02:40
I tried in the below link it selecting the ending tag though the first condition is not true. https://regex101.com/r/VcAS6l/1 — Nick, Sep 15 '19 at 02:43
@Nick is this what you're after [`().*?(<\/a>)`](https://regex101.com/r/VcAS6l/2) — Code Maniac, Sep 15 '19 at 02:51
Don’t do use a regular expression for this. It will work sometimes but it *will* eventually fail. See https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg. — VGR, Sep 15 '19 at 03:26
@VGR I'm trying do this in html response after mapping to string. In string Im using this regex to remove anchor tags. So I don't it cause a problem. Please advise if I'm wrong — Nick, Sep 15 '19 at 03:41
You’re wrong. Let’s start with the fact that HTML can legally use single quotes: `www.ags.ny.gov` And then there’s the possibility that any number of tags might be in comments: `` Those are just two examples. — VGR, Sep 16 '19 at 01:17

Stephen C · Answer 1 · 2019-09-16T03:09:32.677

I am going to assume that you actually want to find anchor tags in a larger document, and that you will want the process to be accurate and relatively efficient.

^{Matching against a string that contains (just) a particular kind of opening anchor tag or a closing anchor tag is not useful. Especially since in the first case you don't check that it is well-formed (see comment about '=' and '"') or extract the anchor's URL in the regex.}

Lets analyze your regex:

  (<a\s\b(href|title)\b.*\">)?|(<[\/]a>)

That is an optional group matching a <a ...> tag OR a non-optional group matching a <\a> tag. It will happily match no instances of the optional group; i.e. nothing at all. The ? is probably misplaced.

Now looking at this

  <a\s\b(href|title)\b.*\">

That says:

'<'
'a'
A space character
A word boundary
A group consisting of "href" or "title"
A word boundary
Zero or more characters
'"'
'>'

A minor problem with that is that 4. is redundant.

A larger problem is that you don't explicitly match the '=' and '"' that should follow the href or title attribute name.

The largest problem is in 7. The '*' in '.*' is a greedy quantifier. It tries to match as much as possible. So in practice it will match all the way to the last '"' and '>' in your document. That's wrong.

To fix the largest problem you needs to use a reluctant quantifier. One that matches as few characters as it can get away with. For example:

    .*?"

will (initially) stop matching at the first '"' that it sees.

Lessons:

It is a bad idea to use regexes to parse structured documents. HTML is particular difficult, because:
- there is so much legal variability in the syntax of an HTML document
- many HTML documents you will find in the wild are malformed.
  
  Instead, use a proper parser. For example, the Jsoup parser is a good option for parsing HTML documents that may be syntactically invalid. Instead of rejecting a document out of hand, it will try to (internally) correct the errors.
If you are going to "borrow" someone else's regexes, you are relying on their ability to right correct regexes, and your ability to understand if their regex is (really) applicable to your problem. (Did they do it correctly? Are the assumptions that they may have made valid in your use-case?)
If you are going to attempt to write your own regexes to parse complicated documents, you need to understand the (Java) regex language. There are some nasty traps; e.g. eager quantification, and catastrophic backtracking.
If you have to debug regexes, you need to treat this like any other code debugging problem:
- Make sure you understand the language (of regexes)
- Read your code (regexes) carefully.
- Explain your code (regexes) to your Rubber Duck. (Not a joke.)
- and so on.

If that sounds too hard, don't use regexes for complicated problems.

Emma · Answer 2 · 2019-09-15T04:20:12.720

This expression might be an option to look into:

<a\s+(?:href|title)=[^>]*>([^<]*)<\/a>

Demo

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class re{

    public static void main(String[] args){

        final String regex = "<a\\s+(?:href|title)=[^>]*>([^<]*)<\\/a>";
        final String string = "<a href=\\\\\\\"http://www.dfs.ny.gov/\\\\\\\">www.dfs.ny.gov</a>, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>\";\n\n"
             + "<a title= \"some title\" href=\\\\\\\"http://www.dfs.ny.gov/\\\\\\\">www.dfs.ny.gov</a>, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>\";\n\n"
             + "<a nottitle= \"some title\" href=\\\\\\\"http://www.dfs.ny.gov/\\\\\\\">www.dfs.ny.gov</a>, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>\";\n\n\n"
             + "<a id=\\\"OLE_LINK2\\\" class=\\\"bookmark\\\" title=\\\"OLE_LINK2\\\" name=\\\"OLE_LINK2\\\"></a>\n\n";
        final String subst = "$1";

        final Pattern pattern = Pattern.compile(regex);
        final Matcher matcher = pattern.matcher(string);

        final String result = matcher.replaceAll(subst);

        System.out.println(result);


    }
}

Output

www.dfs.ny.gov, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>";

www.dfs.ny.gov, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>";

<a nottitle= "some title" href=\\\"http://www.dfs.ny.gov/\\\">www.dfs.ny.gov</a>, for free information on comparative credit card rates, fees and grace periods.&nbsp;</span>";


<a id=\"OLE_LINK2\" class=\"bookmark\" title=\"OLE_LINK2\" name=\"OLE_LINK2\"></a>

RegEx Circuit

jex.im visualizes regular expressions:

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

thank you for the response. As I ask I need to select complete anchor tag except its value like below don't select so that I can replace and with empty space like string.replaceAll(regexExpression, ""). — Nick, Sep 15 '19 at 04:11
sorry My bad the question heading say wrong need I'm editing now — Nick, Sep 15 '19 at 04:20
your response is not working for the below condition https://www.pp.com/i09n/aadvantage-program/terms-and-conditions.jsp — Nick, Sep 15 '19 at 23:36

regular expression to select anchor tag not the value

2 Answers2

Demo

Test

Output

RegEx Circuit