(PHP) Regex for finding specific href tag

Question

i have a html document with n "a href" tags with different target urls and different text between the tag.

For example:

<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>
<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>
<a href="http://www.example.com/d.1234" name="example3">example3</a>
<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>
<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>

As you can see the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.

I need a Regex which gives me all links which has one of these combination in the target url: "d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.

My Regex so far:

href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)

I tried to include the lorem / test as followed:

href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)

but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.

If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.

Thanks!

score 2 · Accepted Answer · answered Jul 18 '11 at 01:18

Here you go:

$html = array
(
    '<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>',
    '<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>',
    '<a href="http://www.example.com/d.1234" name="example3">example3</a>',
    '<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>',
    '<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>',
);

$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');

foreach ($anchors as $anchor)
{
    if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
    {
        $result[] = strval($anchor['href']);
    }
}

echo '<pre>';
print_r($result);
echo '</pre>';

Output:

Array
(
    [0] => http://www.example.com/d?12345abc
    [1] => http://www.example.com/d/d.1234
)

The phXML() function is based on my DOMDocument / SimpleXML wrapper, and goes as follows:

function phXML($xml, $xpath = null)
{
    if (extension_loaded('libxml') === true)
    {
        libxml_use_internal_errors(true);

        if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
        {
            if (is_string($xml) === true)
            {
                $dom = new DOMDocument();

                if (@$dom->loadHTML($xml) === true)
                {
                    return phXML(@simplexml_import_dom($dom), $xpath);
                }
            }

            else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
            {
                if (isset($xpath) === true)
                {
                    $xml = $xml->xpath($xpath);
                }

                return $xml;
            }
        }
    }

    return false;
}

I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.

Thanks! Using DOMDocument() seems to be the best way and I'm familiar with Xpath so that isn't a problem at all. As someone said here: "HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror" — Talisin, Jul 18 '11 at 01:35
@Talisin: Omg, who said that? (I just hope you don't know that by heart!) xD — Alix Axel, Jul 18 '11 at 01:36
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - :D — Talisin, Jul 18 '11 at 01:42

score 1 · Answer 2 · answered Jul 18 '11 at 01:21

1

Here is a Regular Expression which works:

$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);

The only thing is it relies on there being a new-line character between each ` tag. Otherwise it will match something like:

<a href="http://www.example.com/d.1234" name="example3">example3</a><a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>

answered Jul 18 '11 at 01:21

Paul

139,544
27
275
264

Thank you for the regex. It helps me a lot to understand regex more and more. But as you said it relies on the new line and there are some room for errors, so I'll use the DOMDocument() instead. But thanks, I really appreciate it. – Talisin Jul 18 '11 at 01:37
No problemo :) DOMDocument is always better when dealing with HTML anyways :) – Paul Jul 18 '11 at 01:42

score 0 · Answer 3 · edited May 23 '17 at 11:55

0

Use an HTML parser. There are lots of reasons that Regex is absolutely not the solution for parsing HTML.

There's a good list of them here: Robust and Mature HTML Parser for PHP

edited May 23 '17 at 11:55

Community

1
1

answered Jul 18 '11 at 01:02

fletom

1,998
13
17

score 0 · Answer 4 · answered Jul 18 '11 at 01:19

Will print only first and fourth link because two conditions are met.

preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);

for($i = 0; $i < $count; $i++){

    if(
        strpos($matches[1][$i], '/d') !== false 
        &&
        preg_match('#(lorem|test)#is', $matches[3][$i]) == true
    )
    {
        echo $matches[1][$i];    
    }

}

(PHP) Regex for finding specific href tag

4 Answers4