1

i have a html document with n "a href" tags with different target urls and different text between the tag.

For example:

<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>
<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>
<a href="http://www.example.com/d.1234" name="example3">example3</a>
<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>
<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>

As you can see the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.

I need a Regex which gives me all links which has one of these combination in the target url: "d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.

My Regex so far:

href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)

I tried to include the lorem / test as followed:

href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)

but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.

If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.

Thanks!

Talisin
  • 614
  • 1
  • 6
  • 17

4 Answers4

2

Here you go:

$html = array
(
    '<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>',
    '<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>',
    '<a href="http://www.example.com/d.1234" name="example3">example3</a>',
    '<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>',
    '<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>',
);

$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');

foreach ($anchors as $anchor)
{
    if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
    {
        $result[] = strval($anchor['href']);
    }
}

echo '<pre>';
print_r($result);
echo '</pre>';

Output:

Array
(
    [0] => http://www.example.com/d?12345abc
    [1] => http://www.example.com/d/d.1234
)

The phXML() function is based on my DOMDocument / SimpleXML wrapper, and goes as follows:

function phXML($xml, $xpath = null)
{
    if (extension_loaded('libxml') === true)
    {
        libxml_use_internal_errors(true);

        if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
        {
            if (is_string($xml) === true)
            {
                $dom = new DOMDocument();

                if (@$dom->loadHTML($xml) === true)
                {
                    return phXML(@simplexml_import_dom($dom), $xpath);
                }
            }

            else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
            {
                if (isset($xpath) === true)
                {
                    $xml = $xml->xpath($xpath);
                }

                return $xml;
            }
        }
    }

    return false;
}

I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • Thanks! Using DOMDocument() seems to be the best way and I'm familiar with Xpath so that isn't a problem at all. As someone said here: "HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror" – Talisin Jul 18 '11 at 01:35
  • @Talisin: Omg, who said that? (I just hope you don't know that by heart!) xD – Alix Axel Jul 18 '11 at 01:36
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - :D – Talisin Jul 18 '11 at 01:42
1

Here is a Regular Expression which works:

$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);

The only thing is it relies on there being a new-line character between each ` tag. Otherwise it will match something like:

<a href="http://www.example.com/d.1234" name="example3">example3</a><a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>
Paul
  • 139,544
  • 27
  • 275
  • 264
  • Thank you for the regex. It helps me a lot to understand regex more and more. But as you said it relies on the new line and there are some room for errors, so I'll use the DOMDocument() instead. But thanks, I really appreciate it. – Talisin Jul 18 '11 at 01:37
  • No problemo :) DOMDocument is always better when dealing with HTML anyways :) – Paul Jul 18 '11 at 01:42
0

Use an HTML parser. There are lots of reasons that Regex is absolutely not the solution for parsing HTML.

There's a good list of them here: Robust and Mature HTML Parser for PHP

Community
  • 1
  • 1
fletom
  • 1,998
  • 13
  • 17
0

Will print only first and fourth link because two conditions are met.

preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);

for($i = 0; $i < $count; $i++){

    if(
        strpos($matches[1][$i], '/d') !== false 
        &&
        preg_match('#(lorem|test)#is', $matches[3][$i]) == true
    )
    {
        echo $matches[1][$i];    
    }

}
Dejan Marjanović
  • 19,244
  • 7
  • 52
  • 66