Why isn't array_unique returning me a list of unique items?

Question

I am trying to scrape all the urls on the home page on my client's site so I can migrate it to wordpress. The problem is I can't seem to arrive at a de-duplicated list of urls.

Here's the code:

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
   $href = $hrefs->item($i);
   $url = $href->getAttribute('href');

   if($url = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])){
    $urls = $matches[0][0][0];
    $list = implode( ', ', array_unique( explode(", ", $urls) ) );
    echo $list . '<br/>';
    //print_r($list);
   }
}

(Also posted here.)

Instead I am getting duplicates like this:

http://www.catwalkyourself.com/rss.php
http://www.catwalkyourself.com/rss.php

How do I fix this?

No problem. If the code is relatively short like this, it's just generally easier if people don't have to follow an extra link to see it. — Wiseguy, Jun 22 '12 at 18:12
I just couldn't get it to format correctly for some reason - so I made a gist — Amit Erandole, Jun 22 '12 at 18:13

score 3 · Answer 1 · answered Jun 22 '12 at 18:15

The way the code is structured with the loop right now, you are always calling array_unique with an array size of 1.

You need to build a list of URLs and then call array_unique. Try this:

<?php

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$urls  = array();

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url  = $href->getAttribute('href');

    if( ($count = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])) > 0) {
        $urls[] = $matches[0][0][0]; // build list of URLs in the loop
    }
}

$list = implode( ', ', array_unique( $urls ) );
echo $list . '<br/>';

score 1 · Accepted Answer · answered Jun 22 '12 at 18:17

The last part of your code shouldn't be in the loop. You're traversing an array containing every links on the page. As each element of this array contains only one link, you're applying array_unique on an array which can't contain more than one element.

Try something like this:

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$urls = array();

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');

    if($url = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])){
        $urls[] = $matches[0][0][0];
    }
}
$list = implode(', ', array_unique($urls));
echo $list . '<br/>';

Why isn't array_unique returning me a list of unique items?

2 Answers2