2

I am trying to find a link using regexp which appears just before textABCXYZ123 string in below HTML .

lorem ispum...<strong><a href="http://www.site.com/link/123">FIRSTlink</a> </strong><br>
1 points| Saved Jan 08, 2014 at 00:49 <span class=notes_box>ANOTHERLINK</span>.
... more text........... more text........
... more text.......<strong><a href="http://www.site.com/link/123">other link</a> </strong><br>
1 points| Saved Jan 08, 2014 at 00:49 <span class=notes_box>ANOTHERLINK</span>.
... more text........... more text........
<strong><a href="http://www.IneedThis.com/link/123">somewhere to go</a> </strong><br>
1 points| Saved Jan 08, 2014 at 00:49 <span class=notes_box>textABCXYZ123</span>
...
... more text..........<strong><a href="http://www.site.com/link/123">other link</a> </strong><br>
1 points| Saved Jan 08, 2014 at 00:49 <span class=notes_box>ANOTHERLINK</span>.
... more text........... more text........

There are many links and I need to capture the link which appears just before textABCXYZ123 string. i tried below regex but it is returning me first link instead of last one:

$find_string = 'ABCXYZ123';
preg_match('#href="(.*)".*text'.$find_string.'#sU',$html,$match);
// so final resutl is "http://www.site.com/link/123" which is first link

Can someone guide me how can I capture that link just before my string textABCXYZ123? P.S I know about xpath and simple html dom but I would like to match with regexp. Thanks for any input.

user969068
  • 2,818
  • 5
  • 33
  • 64
  • You may want to have a look at this http://stackoverflow.com/questions/13618077/php-regex-to-match-the-last-occurrence-of-a-string for finding the last occurrence. – Braunson Jan 08 '14 at 17:02

2 Answers2

2

You could maybe try the regex:

href="([^"]*)">(?=(?:(?!href).)*textABCXYZ123)

Like so?

$find_string = 'ABCXYZ123';
preg_match('~href="([^"]*)">(?=(?:(?!href).)*text'.$find_string.')~sU',$html,$match);

regex101 demo


The first part is href="([^"]*)"> and shouldn't be too hard to understand. It matches href=" and then any number of non-quote characters, followed by quotes and >.

(?=(?:(?!href).)*textABCXYZ123) first is a positive lookahead. (A positive lookahead has the format (?= ... )) It will make sure that there is what's inside to say that there is a match.

For instance, a(?=.*b) matches any a, as long as there is any characters, then a b somewhere after the a (also means it matches a as long as there's a b somewhere after it).

So, href="([^"]*)"> will match only if there is (?:(?!href).)*textABCXYZ123 somewhere ahead.

(?:(?!href).)* is a modified .*, because the negative lookahead (format (?! ... )) makes sure no href is matched. You could say it's the opposite of a positive lookahead:

a(?!.*b) matches any a as long as it is not followed by a b.

Jerry
  • 70,495
  • 13
  • 100
  • 144
  • Thank you very much, exactly how I wanted, Could you please explain a bit your pattern. I am very novice to regex and it will be a big help to learn. Thanks again. – user969068 Jan 08 '14 at 17:05
  • @user969068 Added some more explanation. Hopefully, that's not too difficult to understand :) – Jerry Jan 08 '14 at 17:10
  • Thanks a lot for your efforts. Much useful. Cant thank enough...Best Regards – user969068 Jan 08 '14 at 17:15
1
(?s)href=[^<]+</a>(?!.*(href).*(textABCXYZ123))(?=.*(textABCXYZ123))

Could also try this, let me know if you want an explantation

Srb1313711
  • 2,017
  • 5
  • 24
  • 35