99

Consider the following array:

/www/htdocs/1/sites/lib/abcdedd
/www/htdocs/1/sites/conf/xyz
/www/htdocs/1/sites/conf/abc/def
/www/htdocs/1/sites/htdocs/xyz
/www/htdocs/1/sites/lib2/abcdedd

what is the shortest and most elegant way of detecting the common base path - in this case

/www/htdocs/1/sites/

and removing it from all elements in the array?

lib/abcdedd
conf/xyz
conf/abc/def
htdocs/xyz
lib2/abcdedd
Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • 4
    This might be worth trying: http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_substring (I tried it and it works). – Richard Knop Jul 19 '10 at 09:05
  • 1
    Awwww! Such a lot of brilliant input. I will be taking one to solve my problem at hand, but I feel that to really pick a justified accepted answer, I'll have to compare the solutions. It may take a while until I get around to doing that, but I certainly will. – Pekka Jul 19 '10 at 12:39
  • entertaining title :D btw: why cant i find you on the nominated moderators list? @Pekka – The Surrican Jan 19 '11 at 16:38
  • @Pekka You definitely need to do a goodbye post on Meta so we can close as dupes the "where is Pekka???" questions! – Trufa Feb 10 '11 at 23:06
  • @Trufa yeah, I definitely will once I leave! – Pekka Feb 10 '11 at 23:14
  • @Pekka hehe good to hear but we'll miss you anyway! :) – Trufa Feb 10 '11 at 23:16
  • @Trufa yeah! I will certainly be popping in to chat from time to time. :) – Pekka Feb 10 '11 at 23:18
  • 2
    no accepted answer for two years? – Gordon Jun 18 '12 at 19:44
  • @Gordon I'd like to test them all properly to find the best answer - which I'm not getting around to right now... – Pekka Jun 18 '12 at 20:16
  • 1
    @Pekka Getting close to three years since this has no accepted answer :( And it's such an awesome title that I remembered it a moment ago and googled "tetrising an array". – Camilo Martin Apr 19 '13 at 13:26
  • Am I the only one that (awesomeness notwithstanding) thinks "testrising" is not the right word for this? – rinogo Aug 19 '15 at 14:07
  • @rinogo, yeah, you're right, "testrising" would be the wrong word. ;) But mocking aside: I think I know what you really meant, but this is the part in Tetris where you make the pit evenly full and remove it up to that height. In this regard, it's an excellent choice of analogy (and word). _Color_ tetris is a bit off here, though, that's true, so it's strictly for the original game. :) – Sz. Nov 12 '18 at 15:47

16 Answers16

35

Write a function longest_common_prefix that takes two strings as input. Then apply it to the strings in any order to reduce them to their common prefix. Since it is associative and commutative the order doesn't matter for the result.

This is the same as for other binary operations like for example addition or greatest common divisor.

starblue
  • 55,348
  • 14
  • 97
  • 151
23

Load them into a trie data structure. Starting from the parent node, see which is having a children count great than one. Once you find that magic node, just dismantle the parent node structure and have the current node as root.

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
bragboy
  • 34,892
  • 30
  • 114
  • 171
  • 10
    Wouldn't the operation that loads the data into the trie tree structure you describe kinda include the algorithm to find the longest common prefix, thus making actually using a tree structure unnecessary? Ie why check the tree for multiple children when you could detect that while building the tree. Why then a tree at all? I mean if you start with an array already. If you can change the storage to just using a trie instead of arrays I guess it makes sense. – Ben Schwehn Jul 18 '10 at 15:21
  • 2
    I think that if you are careful then my solution is more efficient than building a trie. – starblue Jul 24 '10 at 19:50
  • This answer is wrong. There are trivial solutions posted in my and other answers that are O(n). – Ari Ronen Aug 08 '10 at 06:06
  • @el.pescado: Tries are quadradic in size with the length of the source string in the worst case. – Billy ONeal May 08 '12 at 17:06
10
$common = PHP_INT_MAX;
foreach ($a as $item) {
        $common = min($common, str_common($a[0], $item, $common));
}

$result = array();
foreach ($a as $item) {
        $result[] = substr($item, $common);
}
print_r($result);

function str_common($a, $b, $max)
{
        $pos = 0;
        $last_slash = 0;
        $len = min(strlen($a), strlen($b), $max + 1);
        while ($pos < $len) {
                if ($a{$pos} != $b{$pos}) return $last_slash;
                if ($a{$pos} == '/') $last_slash = $pos;
                $pos++;
        }
        return $last_slash;
}
user229044
  • 232,980
  • 40
  • 330
  • 338
Sjoerd
  • 74,049
  • 16
  • 131
  • 175
  • This is by far the best solution posted, but it needed improvement. It didn't take the previous longest common path into account (possibly iterating over more of the string than necessary), and didn't take paths into account (so for `/usr/lib` and `/usr/lib2` it gave `/usr/lib` as the longest common path, rather than `/usr/`). I (hopefully) fixed both. – Gabe Jul 18 '10 at 15:41
8

Well, considering that you can use XOR in this situation to find the common parts of the string. Any time you xor two bytes that are the same, you get a nullbyte as the output. So we can use that to our advantage:

$first = $array[0];
$length = strlen($first);
$count = count($array);
for ($i = 1; $i < $count; $i++) {
    $length = min($length, strspn($array[$i] ^ $first, chr(0)));
}

After that single loop, the $length variable will be equal to the longest common basepart between the array of strings. Then, we can extract the common part from the first element:

$common = substr($array[0], 0, $length);

And there you have it. As a function:

function commonPrefix(array $strings) {
    $first = $strings[0];
    $length = strlen($first);
    $count = count($strings);
    for ($i = 1; $i < $count; $i++) {
        $length = min($length, strspn($strings[$i] ^ $first, chr(0)));
    }
    return substr($first, 0, $length);
}

Note that it does use more than one iteration, but those iterations are done in libraries, so in interpreted languages this will have a huge efficiency gain...

Now, if you want only full paths, we need to truncate to the last / character. So:

$prefix = preg_replace('#/[^/]*$', '', commonPrefix($paths));

Now, it may overly cut two strings such as /foo/bar and /foo/bar/baz will be cut to /foo. But short of adding another iteration round to determine if the next character is either / or end-of-string, I can't see a way around that...

Fanis Hatzidakis
  • 5,282
  • 1
  • 33
  • 36
ircmaxell
  • 163,128
  • 34
  • 264
  • 314
3

A naive approach would be to explode the paths at the / and successive compare every element in the arrays. So e.g. the first element would be empty in all arrays, so it will be removed, the next element will be www, it is the same in all arrays, so it gets removed, etc.

Something like (untested)

$exploded_paths = array();

foreach($paths as $path) {
    $exploded_paths[] = explode('/', $path);
}

$equal = true;
$ref = &$exploded_paths[0]; // compare against the first path for simplicity

while($equal) {   
    foreach($exploded_paths as $path_parts) {
        if($path_parts[0] !== $ref[0]) {
            $equal = false;
            break;
        }
    }
    if($equal) {
        foreach($exploded_paths as &$path_parts) {
            array_shift($path_parts); // remove the first element
        }
    }
}

Afterwards you just have to implode the elements in $exploded_paths again:

function impl($arr) {
    return '/' . implode('/', $arr);
}
$paths = array_map('impl', $exploded_paths);

Which gives me:

Array
(
    [0] => /lib/abcdedd
    [1] => /conf/xyz
    [2] => /conf/abc/def
    [3] => /htdocs/xyz
    [4] => /conf/xyz
)

This might not scale well ;)

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
3

Ok, I'm not sure this is bullet-proof, but I think it works:

echo array_reduce($array, function($reducedValue, $arrayValue) {
    if($reducedValue === NULL) return $arrayValue;
    for($i = 0; $i < strlen($reducedValue); $i++) {
        if(!isset($arrayValue[$i]) || $arrayValue[$i] !== $reducedValue[$i]) {
            return substr($reducedValue, 0, $i);
        }
    }
    return $reducedValue;
});

This will take the first value in the array as reference string. Then it will iterate over the reference string and compare each char with the char of the second string at the same position. If a char doesnt match, the reference string will be shortened to the position of the char and the next string is compared. The function will return the shortest matching string then.

Performance depends on the strings given. The earlier the reference string gets shorter, the quicker the code will finish. I really have no clue how to put that in a formula though.

I found that Artefacto's approach to sort the strings increases performance. Adding

asort($array);
$array = array(array_shift($array), array_pop($array));

before the array_reduce will significantly increase performance.

Also note that this will return the longest matching initial substring, which is more versatile but wont give you the common path. You have to run

substr($result, 0, strrpos($result, '/'));

on the result. And then you can use the result to remove the values

print_r(array_map(function($v) use ($path){
    return str_replace($path, '', $v);
}, $array));

which should give:

[0] => /lib/abcdedd
[1] => /conf/xyz/
[2] => /conf/abc/def
[3] => /htdocs/xyz
[4] => /lib2/abcdedd

Feedback welcome.

Gordon
  • 312,688
  • 75
  • 539
  • 559
3

You could remove prefix the fastest way, reading each character only once:

function findLongestWord($lines, $delim = "/")
{
    $max = 0;
    $len = strlen($lines[0]); 

    // read first string once
    for($i = 0; $i < $len; $i++) {
        for($n = 1; $n < count($lines); $n++) {
            if($lines[0][$i] != $lines[$n][$i]) {
                // we've found a difference between current token
                // stop search:
                return $max;
            }
        }
        if($lines[0][$i] == $delim) {
            // we've found a complete token:
            $max = $i + 1;
        }
    }
    return $max;
}

$max = findLongestWord($lines);
// cut prefix of len "max"
for($n = 0; $n < count($lines); $n++) {
    $lines[$n] = substr(lines[$n], $max, $len);
}
Doomsday
  • 2,650
  • 25
  • 33
  • Indeed, a character-based comparison will be the fastest. All the other solutions use "expensive" operators that in the end also will do (multiple) character comparisons. [It was even mentioned in the scriptures of the Holy Joel](http://www.joelonsoftware.com/articles/fog0000000319.html)! – Jan Fabry Aug 09 '10 at 07:03
2
$values = array('/www/htdocs/1/sites/lib/abcdedd',
                '/www/htdocs/1/sites/conf/xyz',
                '/www/htdocs/1/sites/conf/abc/def',
                '/www/htdocs/1/sites/htdocs/xyz',
                '/www/htdocs/1/sites/lib2/abcdedd'
);


function splitArrayValues($r) {
    return explode('/',$r);
}

function stripCommon($values) {
    $testValues = array_map('splitArrayValues',$values);

    $i = 0;
    foreach($testValues[0] as $key => $value) {
        foreach($testValues as $arraySetValues) {
            if ($arraySetValues[$key] != $value) break 2;
        }
        $i++;
    }

    $returnArray = array();
    foreach($testValues as $value) {
        $returnArray[] = implode('/',array_slice($value,$i));
    }

    return $returnArray;
}


$newValues = stripCommon($values);

echo '<pre>';
var_dump($newValues);
echo '</pre>';

EDIT Variant of my original method using an array_walk to rebuild the array

$values = array('/www/htdocs/1/sites/lib/abcdedd',
                '/www/htdocs/1/sites/conf/xyz',
                '/www/htdocs/1/sites/conf/abc/def',
                '/www/htdocs/1/sites/htdocs/xyz',
                '/www/htdocs/1/sites/lib2/abcdedd'
);


function splitArrayValues($r) {
    return explode('/',$r);
}

function rejoinArrayValues(&$r,$d,$i) {
    $r = implode('/',array_slice($r,$i));
}

function stripCommon($values) {
    $testValues = array_map('splitArrayValues',$values);

    $i = 0;
    foreach($testValues[0] as $key => $value) {
        foreach($testValues as $arraySetValues) {
            if ($arraySetValues[$key] != $value) break 2;
        }
        $i++;
    }

    array_walk($testValues, 'rejoinArrayValues', $i);

    return $testValues;
}


$newValues = stripCommon($values);

echo '<pre>';
var_dump($newValues);
echo '</pre>';

EDIT

The most efficient and elegant answer is likely to involve taking functions and methods from each of the provided answers

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
2

This has de advantage of not having linear time complexity; however, for most cases the sort will definitely not be the operation taking more time.

Basically, the clever part (at least I couldn't find a fault with it) here is that after sorting you will only have to compare the first path with the last.

sort($a);
$a = array_map(function ($el) { return explode("/", $el); }, $a);
$first = reset($a);
$last = end($a);
for ($eqdepth = 0; $first[$eqdepth] === $last[$eqdepth]; $eqdepth++) {}
array_walk($a,
    function (&$el) use ($eqdepth) {
        for ($i = 0; $i < $eqdepth; $i++) {
            array_shift($el);
        }
     });
$res = array_map(function ($el) { return implode("/", $el); }, $a);
Artefacto
  • 96,375
  • 17
  • 202
  • 225
1

I would explode the values based on the / and then use array_intersect_assoc to detect the common elements and ensure they have the correct corresponding index in the array. The resulting array could be recombined to produce the common path.

function getCommonPath($pathArray)
{
    $pathElements = array();

    foreach($pathArray as $path)
    {
        $pathElements[] = explode("/",$path);
    }

    $commonPath = $pathElements[0];

    for($i=1;$i<count($pathElements);$i++)
    {
        $commonPath = array_intersect_assoc($commonPath,$pathElements[$i]);
    }

    if(is_array($commonPath) return implode("/",$commonPath);
    else return null;
}

function removeCommonPath($pathArray)
{
    $commonPath = getCommonPath($pathArray());

    for($i=0;$i<count($pathArray);$i++)
    {
        $pathArray[$i] = substr($pathArray[$i],str_len($commonPath));
    }

    return $pathArray;
}

This is untested, but, the idea is that the $commonPath array only ever contains the elements of the path that have been contained in all path arrays that have been compared against it. When the loop is complete, we simply recombine it with / to get the true $commonPath

Update As pointed out by Felix Kling, array_intersect won't consider paths that have common elements but in different orders... To solve this, I used array_intersect_assoc instead of array_intersect

Update Added code to remove the common path (or tetris it!) from the array as well.

Brendan Bullen
  • 11,607
  • 1
  • 31
  • 40
1

The problem can be simplified if just viewed from the string comparison angle. This is probably faster than array-splitting:

$longest = $tetris[0];  # or array_pop()
foreach ($tetris as $cmp) {
        while (strncmp($longest+"/", $cmp, strlen($longest)+1) !== 0) {
                $longest = substr($longest, 0, strrpos($longest, "/"));
        }
}
mario
  • 144,265
  • 20
  • 237
  • 291
  • That won't work e.g. with this set array('/www/htdocs/1/sites/conf/abc/def', '/www/htdocs/1/sites/htdocs/xyz', '/www/htdocs/1/sitesjj/lib2/abcdedd',). – Artefacto Jul 18 '10 at 13:27
  • @Artefacto: You were right. So I've simply modified it to always include a trailing slash "/" in the comparison. Makes it non-ambiguous. – mario Jul 18 '10 at 14:10
1

Perhaps porting the algorithm Python's os.path.commonprefix(m) uses would work?

def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m: return ''
    s1 = min(m)
    s2 = max(m)
    n = min(len(s1), len(s2))
    for i in xrange(n):
        if s1[i] != s2[i]:
            return s1[:i]
    return s1[:n]

That is, uh... something like

function commonprefix($m) {
  if(!$m) return "";
  $s1 = min($m);
  $s2 = max($m);
  $n = min(strlen($s1), strlen($s2));
  for($i=0;$i<$n;$i++) if($s1[$i] != $s2[$i]) return substr($s1, 0, $i);
  return substr($s1, 0, $n);
}

After that you can just substr each element of the original list with the length of the common prefix as the start offset.

AKX
  • 152,115
  • 15
  • 115
  • 172
1

I'll throw my hat in the ring …

function longestCommonPrefix($a, $b) {
    $i = 0;
    $end = min(strlen($a), strlen($b));
    while ($i < $end && $a[$i] == $b[$i]) $i++;
    return substr($a, 0, $i);
}

function longestCommonPrefixFromArray(array $strings) {
    $count = count($strings);
    if (!$count) return '';
    $prefix = reset($strings);
    for ($i = 1; $i < $count; $i++)
        $prefix = longestCommonPrefix($prefix, $strings[$i]);
    return $prefix;
}

function stripPrefix(&$string, $foo, $length) {
    $string = substr($string, $length);
}

Usage:

$paths = array(
    '/www/htdocs/1/sites/lib/abcdedd',
    '/www/htdocs/1/sites/conf/xyz',
    '/www/htdocs/1/sites/conf/abc/def',
    '/www/htdocs/1/sites/htdocs/xyz',
    '/www/htdocs/1/sites/lib2/abcdedd',
);

$longComPref = longestCommonPrefixFromArray($paths);
array_walk($paths, 'stripPrefix', strlen($longComPref));
print_r($paths);
rik
  • 8,592
  • 1
  • 26
  • 21
1

Well, there are already some solutions here but, just because it was fun:

$values = array(
    '/www/htdocs/1/sites/lib/abcdedd',
    '/www/htdocs/1/sites/conf/xyz',
    '/www/htdocs/1/sites/conf/abc/def', 
    '/www/htdocs/1/sites/htdocs/xyz',
    '/www/htdocs/1/sites/lib2/abcdedd' 
);

function findCommon($values){
    $common = false;
    foreach($values as &$p){
        $p = explode('/', $p);
        if(!$common){
            $common = $p;
        } else {
            $common = array_intersect_assoc($common, $p);
        }
    }
    return $common;
}
function removeCommon($values, $common){
    foreach($values as &$p){
        $p = explode('/', $p);
        $p = array_diff_assoc($p, $common);
        $p = implode('/', $p);
    }

    return $values;
}

echo '<pre>';
print_r(removeCommon($values, findCommon($values)));
echo '</pre>';

Output:

Array
(
    [0] => lib/abcdedd
    [1] => conf/xyz
    [2] => conf/abc/def
    [3] => htdocs/xyz
    [4] => lib2/abcdedd
)
acm
  • 6,541
  • 3
  • 39
  • 44
0
$arrMain = array(
            '/www/htdocs/1/sites/lib/abcdedd',
            '/www/htdocs/1/sites/conf/xyz',
            '/www/htdocs/1/sites/conf/abc/def',
            '/www/htdocs/1/sites/htdocs/xyz',
            '/www/htdocs/1/sites/lib2/abcdedd'
);
function explodePath( $strPath ){ 
    return explode("/", $strPath);
}

function removePath( $strPath)
{
    global $strCommon;
    return str_replace( $strCommon, '', $strPath );
}
$arrExplodedPaths = array_map( 'explodePath', $arrMain ) ;

//Check for common and skip first 1
$strCommon = '';
for( $i=1; $i< count( $arrExplodedPaths[0] ); $i++)
{
    for( $j = 0; $j < count( $arrExplodedPaths); $j++ )
    {
        if( $arrExplodedPaths[0][ $i ] !== $arrExplodedPaths[ $j ][ $i ] )
        {
            break 2;
        } 
    }
    $strCommon .= '/'.$arrExplodedPaths[0][$i];
}
print_r( array_map( 'removePath', $arrMain ) );

This works fine... similar to mark baker but uses str_replace

KoolKabin
  • 17,157
  • 35
  • 107
  • 145
0

Probably too naive and noobish but it works. I have used this algorithm:

<?php

function strlcs($str1, $str2){
    $str1Len = strlen($str1);
    $str2Len = strlen($str2);
    $ret = array();

    if($str1Len == 0 || $str2Len == 0)
        return $ret; //no similarities

    $CSL = array(); //Common Sequence Length array
    $intLargestSize = 0;

    //initialize the CSL array to assume there are no similarities
    for($i=0; $i<$str1Len; $i++){
        $CSL[$i] = array();
        for($j=0; $j<$str2Len; $j++){
            $CSL[$i][$j] = 0;
        }
    }

    for($i=0; $i<$str1Len; $i++){
        for($j=0; $j<$str2Len; $j++){
            //check every combination of characters
            if( $str1[$i] == $str2[$j] ){
                //these are the same in both strings
                if($i == 0 || $j == 0)
                    //it's the first character, so it's clearly only 1 character long
                    $CSL[$i][$j] = 1; 
                else
                    //it's one character longer than the string from the previous character
                    $CSL[$i][$j] = $CSL[$i-1][$j-1] + 1; 

                if( $CSL[$i][$j] > $intLargestSize ){
                    //remember this as the largest
                    $intLargestSize = $CSL[$i][$j]; 
                    //wipe any previous results
                    $ret = array();
                    //and then fall through to remember this new value
                }
                if( $CSL[$i][$j] == $intLargestSize )
                    //remember the largest string(s)
                    $ret[] = substr($str1, $i-$intLargestSize+1, $intLargestSize);
            }
            //else, $CSL should be set to 0, which it was already initialized to
        }
    }
    //return the list of matches
    return $ret;
}


$arr = array(
'/www/htdocs/1/sites/lib/abcdedd',
'/www/htdocs/1/sites/conf/xyz',
'/www/htdocs/1/sites/conf/abc/def',
'/www/htdocs/1/sites/htdocs/xyz',
'/www/htdocs/1/sites/lib2/abcdedd'
);

// find the common substring
$longestCommonSubstring = strlcs( $arr[0], $arr[1] );

// remvoe the common substring
foreach ($arr as $k => $v) {
    $arr[$k] = str_replace($longestCommonSubstring[0], '', $v);
}
var_dump($arr);

Output:

array(5) {
  [0]=>
  string(11) "lib/abcdedd"
  [1]=>
  string(8) "conf/xyz"
  [2]=>
  string(12) "conf/abc/def"
  [3]=>
  string(10) "htdocs/xyz"
  [4]=>
  string(12) "lib2/abcdedd"
}

:)

Richard Knop
  • 81,041
  • 149
  • 392
  • 552
  • @Doomsday There is a link to wikipedia in my answer... try to read it first before commenting. – Richard Knop Jul 22 '10 at 16:23
  • I think in the end you only compare the first two paths. In your example this works, but if you remove the first path, it will find `/www/htdocs/1/sites/conf/` as a common match. Also, the algorithm searches for substrings starting anywhere in the string, but for this question you know you can start at location 0, which makes it much simpler. – Jan Fabry Aug 09 '10 at 06:58