I am going to assume that you actually want to find anchor tags in a larger document, and that you will want the process to be accurate and relatively efficient.
Matching against a string that contains (just) a particular kind of opening anchor tag or a closing anchor tag is not useful. Especially since in the first case you don't check that it is well-formed (see comment about '='
and '"'
) or extract the anchor's URL in the regex.
Lets analyze your regex:
(<a\s\b(href|title)\b.*\">)?|(<[\/]a>)
That is an optional group matching a <a ...>
tag OR a non-optional group matching a <\a>
tag. It will happily match no instances of the optional group; i.e. nothing at all. The ?
is probably misplaced.
Now looking at this
<a\s\b(href|title)\b.*\">
That says:
'<'
'a'
- A space character
- A word boundary
- A group consisting of
"href"
or "title"
- A word boundary
- Zero or more characters
'"'
'>'
A minor problem with that is that 4. is redundant.
A larger problem is that you don't explicitly match the '='
and '"'
that should follow the href or title attribute name.
The largest problem is in 7. The '*'
in '.*'
is a greedy quantifier. It tries to match as much as possible. So in practice it will match all the way to the last '"'
and '>'
in your document. That's wrong.
To fix the largest problem you needs to use a reluctant quantifier. One that matches as few characters as it can get away with. For example:
.*?"
will (initially) stop matching at the first '"'
that it sees.
Lessons:
It is a bad idea to use regexes to parse structured documents. HTML is particular difficult, because:
- there is so much legal variability in the syntax of an HTML document
many HTML documents you will find in the wild are malformed.
Instead, use a proper parser. For example, the Jsoup parser is a good option for parsing HTML documents that may be syntactically invalid. Instead of rejecting a document out of hand, it will try to (internally) correct the errors.
If you are going to "borrow" someone else's regexes, you are relying on their ability to right correct regexes, and your ability to understand if their regex is (really) applicable to your problem. (Did they do it correctly? Are the assumptions that they may have made valid in your use-case?)
If you are going to attempt to write your own regexes to parse complicated documents, you need to understand the (Java) regex language. There are some nasty traps; e.g. eager quantification, and catastrophic backtracking.
If you have to debug regexes, you need to treat this like any other code debugging problem:
- Make sure you understand the language (of regexes)
- Read your code (regexes) carefully.
- Explain your code (regexes) to your Rubber Duck. (Not a joke.)
- and so on.
If that sounds too hard, don't use regexes for complicated problems.