i have a html document with n "a href" tags with different target urls and different text between the tag.
For example:
<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>
<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>
<a href="http://www.example.com/d.1234" name="example3">example3</a>
<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>
<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>
As you can see the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.
I need a Regex which gives me all links which has one of these combination in the target url: "d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.
My Regex so far:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)
I tried to include the lorem / test as followed:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)
but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.
If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.
Thanks!