0

I'm trying to build regex to extract links from text which have not rel="nofollow".

Example:

aiusdiua asudauih <a rel="nofollow" hre="http://uashiuadha.asudh/adas>adsaag</a> uhwaida <br> asdgydug <a href="http://asdha.sda/uduih/dufhuis>aguuia</a>

Thanks!

2 Answers2

2

The following regex will do the job:

<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"

The wanted urls will be in the capture group #1. E.g. in Ruby it would be:

if input =~ /<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"/
    match = $~[1]
end

Since it accepts [^>]*? before rel in the negative lookahead, href or anything else can come before rel. If href comes after rel, it'll of course also be ok.

Staffan Nöteberg
  • 4,095
  • 1
  • 19
  • 17
0

Try this <(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"]([^>"]*)[^>]*?>

if you are using .net regex then

<(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"](?<URL>[^>"]*)[^>]*?>

data lies in group named URL or group 1

Rajeev
  • 4,571
  • 2
  • 22
  • 35