0

Regex Pattern to Grab Data between a href Tag with limited characters ignoring numbers

I need a regex pattern to match any text that comes between:

<a href="https://website.com">Health & Beauty</a>

that may or may not include a space and/or special character "&" but it should not contain any numbers. It should also not exceed the character limit from 4 to 10. In said case, I would want to extract:

Beauty & Fashion

I was advised to use the following pattern:

(?<=&|>)([^&\r\n]{4,10}(?=&|<\/a>))*

It worked great but now the problem is how to make the pattern that will ignore everything that contains the number within tags like

<a href="#">January 2019</a> 
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Please add some additional clarification; your question contains some internal contradictions. The regex in question (1) has a typo (you need to escape the `/`), (2) would not match everything between the `>` and `<` in your example, and (3) certainly would not match `Beauty & Fashion` in `Health & Beauty`. – elixenide Sep 06 '19 at 19:39
  • `Beauty & Fashion` are more than 4-10 chars in total. The negated character class `[^&\r\n]` contains an ampersand meaning it would not match it. What exactly are you trying to match? – The fourth bird Sep 07 '19 at 11:16

2 Answers2

0

First of all, you really shouldn't be parsing HTML with Regex, see https://stackoverflow.com/a/1732454/1687909

That being said, if that regex you posted is working for you, you can just add numbers to the exclusion to prevent it from matching numbers:

(?<=&|>)([^0-9&\r\n]{4,10}(?=&|<\/a>))*
Mogzol
  • 1,405
  • 1
  • 12
  • 18
0

Using The best Regex Trick Ever this will capture ones with numbers in regex group 1, and ones without numbers into regex group 2, so you can look at that group to pull just the pattern you want:

(?<=<a href[^>]+?>)([^<]*?[0-9][^<]*|([^<]*?)(?=<))

At least, it will in .Net which supports variable-width lookbehinds. It won't work in PCRE which doesn't, or Javascript which doesn't support lookbehinds at all. You didn't say which regex dialect you're using.

TessellatingHeckler
  • 27,511
  • 4
  • 48
  • 87