1

For optimisation purposes, I need to intersect two arrays and keep the least number of duplicate values from the two initial arrays in the resulting array.

The order of values in the resulting array is not important.

Another important constraint is time complexity as this will be executed in a big loop.

Why array_intersect doesn't work :

From Shawn Pyle in the PHP docs :

array_intersect handles duplicate items in arrays differently. If there are duplicates in the first array, all matching duplicates will be returned. If there are duplicates in any of the subsequent arrays they will not be returned.

Rules :

  • Returns values of $arr1 that are in $arr2
  • If $arr1 or $arr2 contain duplicate values, return the least number of values between the two

Examples :

  • intersect([1, 1, 2, 3, 4, 4, 5], [1, 3, 3, 5, 5]) returns [1, 3, 5]
  • intersect([1, 1, 2, 3, 4, 4, 5], [1, 1, 1, 3, 3, 5, 5]) returns [1, 1, 3, 5]
  • intersect([1, 1, 2, 3, 4, 4, 5, 5], [1, 3, 3, 5, 5]) returns [1, 3, 5, 5]
  • intersect([1, 1, 1], [1, 1, 1]) returns [1, 1, 1]
  • intersect([1, 2, 3], [1, 3, 2]) returns [1, 2, 3]
mickmackusa
  • 43,625
  • 12
  • 83
  • 136

3 Answers3

1

Hi I have to say that at first look the @Aderrahim answer look really nice, but then I tried to use a simple approach and test the performance.

Here is the code:

function intersectSimple($a, $b)
{
    $result = array();
    $short = count($a) < count($b) ? $a : $b;
    $long = count($a) < count($b) ? $b : $a;
    foreach ($short as $v) {
        if (in_array($v, $long)) {
            //if found add to results and remove from b
            $result[] = $v;
            unset($long[array_search($v, $long)]);
        }
    }
    return $result;
}

function intersectAderrahim($a, $b)
{
    $a_values_count = array_count_values($a);
    $b_values_count = array_count_values($b);

    $res = array_values(array_intersect($a, $b));
    $res_values_count = array_count_values($res);
    foreach ($res as $key => $val)
    {
        if ($res_values_count[$val] > $a_values_count[$val] || $res_values_count[$val] > $b_values_count[$val])
        {
            unset($res[$key]);
            $res_values_count[$val]--;
        }
    }

    return array_values($res);
}

//Start timer
$start = microtime(true);

echo "Start Test\n";
//Test code print each assert result
//Run code 100000 times
for ($i = 0; $i < 100000; $i++)
{
    $a = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    $b = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    $result = intersectSimple($a, $b);
    assert(count($result) == 10);
}

//Stop timer
$end = microtime(true);
$time = $end - $start;
//Print performance in microseconds
echo "Performance Simple: $time\n";

//Start timer
$start = microtime(true);

echo "Start Test\n";
//Test code print each assert result
//Run code 100000 times
for ($i = 0; $i < 100000; $i++)
{
    $a = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    $b = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    $result = intersectAderrahim($a, $b);
    assert(count($result) == 10);
}
//Stop timer
$end = microtime(true);
$time = $end - $start;
//Print performance in microseconds
echo "Performance Aderrahim: $time\n";

So I write a fast performance test and run it, the results are:

Start Test
Performance Simple: 0.060362815856934
Start Test
Performance Aderrahim: 0.16634893417358

I don't know if this can be extrapolated to a real case, but you can try in your scenario and test which is better. I really love to know which is the best with real data.

  • 1
    Here are my benchmarks : ["Simple",44.815133810043335] ["Abderrahim",50.60647702217102] So it is about 12% faster with quasi-real numbers and arrays. – Abderrahim Benmelouka Jul 26 '22 at 12:01
  • You can increase performance depending on the type of data you have creating a "cache". Concatenating the tables as string and create a dictionary with the result. Then the first thing you is check if this array combination is already calculated. Let's see if somebody give a better answer – Adria Riu Ruiz Jul 26 '22 at 12:15
  • Sorry, I was wrong in the benchmarks. It seems like mine is considerably faster after all. I generated two arrays of 100 integers between 1 and 5 and ran both functions on these two arrays 10000 times, I got these results : ["Abderrahim",0.6441829204559326] ["Simple",1.9437551498413086] – Abderrahim Benmelouka Jul 26 '22 at 12:17
  • This is really interesting and logic, your code is more efficient if you have low variability of numbers, in your scenario 1...5, so the array_count_values is a very small array. Can you check it, if the values are maybe 1 to 50. – Adria Riu Ruiz Jul 26 '22 at 12:29
  • With values from 1 to 50 : ["Abderrahim",1.0948610305786133] ["Simple",1.574558973312378] I should mention that these results are with PHP 7.2, the difference is less extreme with PHP 8.1 : ["Abderrahim",0.8305587768554688] ["Simple",1.0088911056518555] – Abderrahim Benmelouka Jul 26 '22 at 13:28
0

Here's my attempt, I'm basically looking for a faster way to do this if possible :

function intersect($a, $b)
{
    $a_values_count = array_count_values($a);
    $b_values_count = array_count_values($b);

    $res = array_values(array_intersect($a, $b));
    $res_values_count = array_count_values($res);
    foreach ($res as $key => $val)
    {
        if ($res_values_count[$val] > $a_values_count[$val] || $res_values_count[$val] > $b_values_count[$val])
        {
            unset($res[$key]);
            $res_values_count[$val]--;
        }
    }

    return array_values($res);
}

assert(intersect([1, 1, 2, 3, 4, 4, 5], [1, 3, 3, 5, 5]) == [1, 3, 5]);
assert(intersect([1, 1, 2, 3, 4, 4, 5], [1, 1, 1, 3, 3, 5, 5]) == [1, 1, 3, 5]);
assert(intersect([1, 1, 2, 3, 4, 4, 5, 5], [1, 3, 3, 5, 5]) == [1, 3, 5, 5]);
assert(intersect([1, 1, 1], [1, 1, 1]) == [1, 1, 1]);
assert(intersect([1, 2, 3], [1, 3, 2]) == [1, 2, 3]);
0

There were a couple things about @AdriaRiuRuiz's snippet that could be refined.

  1. count() should not be called more than once on the same unchanged array.
  2. array_search() serves the same purpose as in_array(), but returns the first value's key. For this reason, in_array() can be omitted; if the value is found, unset it by the returned key.

Code: (Benchmarks)

function intersections(array $a, array $b): array
{
    $result = [];
    if (count($a) < count($b)) {
        $short = $a;
        $long = $b;
    } else {
        $short = $b;
        $long = $a;
    }
    foreach ($short as $v) {
        $index = array_search($v, $long);
        if ($index !== false) {
            $result[] = $v;
            unset($long[$index]);
        }
    }
    return $result;
}

Depending on the input data, it can be faster to omit the count-based soeting of the arrays.

function intersections(array $a, array $b): array
{
    $result = [];
    foreach ($a as $v) {
        $index = array_search($v, $b);
        if ($index !== false) {
            $result[] = $v;
            unset($b[$index]);
        }
    }
    return $result;
}
mickmackusa
  • 43,625
  • 12
  • 83
  • 136