Regex for extracting links with specified attributes

Question

I'm trying to build regex to extract links from text which have not rel="nofollow".

Example:

aiusdiua asudauih <a rel="nofollow" hre="http://uashiuadha.asudh/adas>adsaag</a> uhwaida <br> asdgydug <a href="http://asdha.sda/uduih/dufhuis>aguuia</a>

Thanks!

... And is there any possibility that you can use a parser instead of regex? — jensgram, Apr 01 '11 at 08:15

Staffan Nöteberg · Accepted Answer · 2011-04-01T09:52:33.743

2

The following regex will do the job:

<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"

The wanted urls will be in the capture group #1. E.g. in Ruby it would be:

if input =~ /<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"/
    match = $~[1]
end

Since it accepts [^>]*? before rel in the negative lookahead, href or anything else can come before rel. If href comes after rel, it'll of course also be ok.

edited Apr 01 '11 at 09:52

answered Apr 01 '11 at 08:31

Staffan Nöteberg

4,095
1
19
17

My experience with regex Always take care while using .*? – Rajeev Apr 01 '11 at 08:35
@regexhacks: why should one take care when using `.*?` – Mauritz Hansen Apr 01 '11 at 08:58
@regexhacks I agree. One must be careful with all quantifiers that accepts nothing or unlimited. – Staffan Nöteberg Apr 01 '11 at 09:07
@regexhacks Since it accepts `[^>]*?`before `rel` in the *negative lookahead*, `href` or anything else can come before `rel`. If href comes after rel, it'll of course also be ok. – Staffan Nöteberg Apr 01 '11 at 09:13
@regexhacks Thanks for asking for details, it was probably needed. – Staffan Nöteberg Apr 01 '11 at 09:51

Rajeev · Answer 2 · 2011-04-01T08:45:55.463

0

Try this <(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"]([^>"]*)[^>]*?>

if you are using .net regex then

<(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"](?<URL>[^>"]*)[^>]*?>

data lies in group named URL or group 1

edited Apr 01 '11 at 08:45

answered Apr 01 '11 at 08:26

Rajeev

4,571
2
22
35

I think you will have to fix two issues on this answer: 1) Right now it will find strings that actually have `rel="nofollow"`, but the question asked for the opposite. 2) It won't match if `href` comes before `rel` in a `a` tag. – Staffan Nöteberg Apr 01 '11 at 08:39
It will still match ``, won't it? I think you need one more edit session for that regex :-) – Staffan Nöteberg Apr 01 '11 at 09:04
Yep! It should. Yours one in better than mine! :-) I won't edit. supporting answer. – Rajeev Apr 01 '11 at 09:10

Regex for extracting links with specified attributes

2 Answers2