-1

I need to identify unique urls from an array.

All of the following variants should count as equal:

http://google.com
https://google.com
http://www.google.com
https://www.google.com
www.google.com
google.com

I have the following solution:

public static function array_unique_url(array $array) : array
{
    $uniqueArray = [];
    foreach($array as $item) {
        if(!self::in_array_url($item, $uniqueArray)){
            $uniqueArray[] = $item;
        }
    }
    return $uniqueArray;
}

public static function in_array_url(string $needle, array $haystack): bool {
    $haystack = array_map([self::class, 'normalizeUrl'], $haystack);
    $needle = self::normalizeUrl($needle);

    return in_array($needle, $haystack);
}

public static function normalizeUrl(string $url) {
    $url = strtolower($url);
    return preg_replace('#^(https?://)?(www.)?#', '', $url);
}

However, this is not very efficient O(n^2). Can anybody point me to a better solution?

Sahil Gulati
  • 15,028
  • 4
  • 24
  • 42
Chris
  • 13,100
  • 23
  • 79
  • 162

6 Answers6

1

in_array is expensive. Instead of doing that create a hash and store values as their counts. Something like:

$myHash = []; //a global array to hold values.

And while checking, Do this:

if(!empty($myHash[$needle] )){
   //already exits
}
informer
  • 821
  • 6
  • 18
0

I haven't test it, but maybe something like this will work:

function getUniqueUrls(array $urls)
{
    $unique_urls = [];
    foreach ($urls as $url) {
        $normalized_url = preg_replace('#^(https?://)?(www.)?#', '', strtolower($url));
        $unique_urls[$normalized_url] = true;
    }

    return array_keys($unique_urls);
}

$arr = [
    'http://google.com',
    'https://google.com',
    'http://www.google.com',
    'https://www.google.com',
    'www.google.com',
    'google.com'
];

$unique_urls = getUniqueUrls($arr);
arbogastes
  • 1,308
  • 9
  • 10
0

Here is a simplified version. It does not use preg_replace as it costs a lot. Also it does not make any unnecessary string operation.

$urls = array(
    "http://google.com",
    "https://google.com",
    "http://www.google.com",
    "https://www.google.com",
    "www.google.com",
    "google.com"
);

$uniqueUrls = array();

foreach($urls as $url) {
    $subPos = 0;
    if(($pos = stripos($url, "://")) !== false) {
        $subPos = $pos + 3;
    }
    if(($pos = stripos($url, "www.", $subPos)) !== false) {
        $subPos = $pos + 4;
    }
    $subStr = strtolower(substr($url, $subPos));
    if(!in_array($subStr, $uniqueUrls)) {
        $uniqueUrls[] = $subStr;
    }
}

var_dump($uniqueUrls);

Another performance optimization could be implementing binary search on the unique urls because 'in_array' search the whole array as it is not sorted.

katona.abel
  • 771
  • 4
  • 12
0
<?php 

$urls = [
    'http://google.com',
    'https://google.com',
    'http://www.google.com',
    'https://www.google.com',
    'www.google.com',
    'google.com',
    'testing.com:9200'
];

$uniqueUrls = [];

foreach ($urls as $url) {
    $urlData = parse_url($url);
    $urlHostName = array_key_exists('host',$urlData) ? $urlData['host'] : $urlData['path'];
    $host = str_replace('www.', '', $urlHostName);
    if(!in_array($host, $uniqueUrls) && $host != ''){
        array_push($uniqueUrls, $host);
    }
}
print_r($uniqueUrls);

?>
Mihir Bhende
  • 8,677
  • 1
  • 30
  • 37
0

why you normlize your result array everytime?

here is a better solution with your code:

public static function array_unique_url(array $array): array
{
    $uniqueArray = [];
    foreach ($array as $item) {
        if (!isset($uniqueArray[$item])) {
            $uniqueArray[$item] = self::normalizeUrl($item);
        }
    }

    return $uniqueArray;
}

public static function normalizeUrl(string $url)
{
    return preg_replace('#^(https?://)?(www.)?#', '', strtolower($url));
}

When you want your original items you can use array_keys(array_unique_url($array))

for your normalized urls you don't need array_keys

Sysix
  • 1,572
  • 1
  • 16
  • 23
0

Try this simplest solution. Here we are using two functions preg_replace and parse_url for achieving desired output

Try this code snippet here

<?php

$urls = array(
    "http://google.com",
    "https://google.com",
    "http://www.google.com",
    "https://www.google.com",
    "www.google.com",
    "google.com"
);

$uniqueUrls=array();
foreach($urls as $url)
{
    $changedUrl=  preg_replace("/^(https?:\/\/)?/", "http://", $url);//adding http to urls which does not contains.
    $domain=  preg_replace("/^(www\.)?/","",parse_url($changedUrl,PHP_URL_HOST));//getting the desired host and then removing its www.
    preg_match("/^[a-zA-Z0-9]+/", $domain,$matches);//filtering on the basis of domains
    $uniqueUrls[$matches[0]]=$domain;
}
print_r(array_values($uniqueUrls));
Sahil Gulati
  • 15,028
  • 4
  • 24
  • 42