Efficient code for finding route between nodes of a data tree

Question

I have a file with the following format:

Y1DP480P T FDVII005 ID=000
Y1DPMS7M T Y1DP480P ID=000
Y1DPMS7M T Y1DP4860 ID=000
Y1DPMS7M T Y1ENDCYP ID=000
Y1DPMS6M T Y1DPMS7M ID=000
Y1DPMS5M T VPY1CM28 ID=000
Y1DPMS5M T Y1DPMS6M ID=000
Y1DPAS21 T Y1DPMS5M ID=000
Y1DPMS4M T FDRBC004 ID=000
Y1DPMS4M T FDYBL004 ID=000

etc. etc.

only the data in column 1-8 and 12-19 is used and can be thought of as:

node1 -> node2
node1 -> node3
node3 -> node5
node2 -> node4
node4 -> node5
node5 -> node7

I need an efficient way to map the path from a given start node to a given end node.

For example, if I want the path from node1 to node7, the function would return node1->node3, node3->node5, node5->node7.

Current approach:

I read the file into an array taking the first 19 characters as both the key and the value e.g.

$data[Y1DP480P T FDVII005] = 'Y1DP480P T FDVII005'

(I use the value as the key because the input file may contain duplicates as this filters them out - I don't think PHP has a 'set' data structure).

I have a recursive subroutine that finds the next 'n' dependants from a given node as follows:

(on entry, $path[] is an empty array, node data is in $data, the node to start the search from is $job and the depth of dependants is $depth)

function createPathFrom($data, $job, $depth) {
    global $path, $maxDepth, $timeStart;
    $job = trim($job);
    // echo "Looking for $job\n";
    if ( $depth > $maxDepth ) {return;} // Search depth exceeded
    // if ( (microtime(true) - $timeStart) > 70 ) {return;} //Might not be needed as we have the same further down
    // $depth += 1;
    // Get the initial list of predecessors for this job.
    // echo __FUNCTION__."New iteration at depth $depth for $job\n";
    $dependents = array_filter($data, function($dataLine) use($job){
        // preg_match('/'.JOB_SPLIT_MASK.'/', $dataLine, $result);
        // $dependent = trim($result[1]);
        $dependent = explode(" ", $dataLine)[0];
        return ( $dependent == $job );
        // return ( preg_match('/'.$job.'/', $dependent) );
    });

    if (count($dependents) == 0) {
        return;
    } else {
        // print_r($predecessors);
        $elapsedTime = microtime(true) - $timeStart;
        // print $elapsedTime." : Searching ".count($dependents)." at depth ".$depth.NL;

        $path = array_merge($path, $dependents);
        foreach($dependents as $dependency) {
            // preg_match('/'.JOB_SPLIT_MASK.'/', $dependency, $result);
            // $dependent = trim($result[3]);
            $dependent = explode(" ", $dependency)[2];
            if ( (microtime(true) - $timeStart) > 85 ) {return;} // Let's get out if running out of time... (90s in HTTPD/conf)
            createPathFrom($data, $dependent, $depth+1);
        }
    }
}

I have an almost identical function that established the predecessors for my end node called createPathTo

The time limits (70s & 85s and yes - one is definitely redundant) and the depth limit are to avoid my cgi-script timing out.

If I call both routines with enough 'depth', I can see if they connect, but there are a lot of dead-ends.

I think I'm doing a breadth-first search whereas I think I should be doing a depth-first search and throwing away the searches that don't reach my target node.

Question:

Giving a start node and an end node, is there en efficient search algorithm that will return the bare minimum of nodes to make the connection or some value indicating that no path was found?

This question follows on from Recursive function in PHP to find path between arbitrary nodes. I have the nodes leading to (and now from) my target node but now I want to trim it to just the path between 2 nodes.

Edit: I'm sure the answer is already here on SO, but I'm pretty new to PHP and these sorts of algorithms, so haven't been able to find one.

trincot · Accepted Answer · 2020-01-22T19:30:28.233

2

You'll be better off with a structure like this:

$data =[
    "Y1DP480P" => ["FDVII005" => true],
    "Y1DPMS7M" => ["Y1DP480P" => true, "Y1DP4860" => true, "Y1ENDCYP" => true],
    // ...etc
];

So, per key you have a "set" of child keys that can be reached with one step from that first key. Although sets do not exist, this how you typically mimic that: use an associative array with true values (or whatever you prefer instead). This will also ignore the duplicate entries you might have in the input.

Then, a standard BFS will be quite efficient:

$input = "aaaaaaaa T bbbbbbbb ID=000
aaaaaaaa T cccccccc ID=000
cccccccc T eeeeeeee ID=000
bbbbbbbb T dddddddd ID=000
dddddddd T eeeeeeee ID=000
eeeeeeee T gggggggg ID=000";

// Convert input to the data structure:
$data = [];
foreach (explode("\n", $input) as $line) {
    list($a, $b) = explode(" T ", substr($line, 0, 19));
    $data[$a][$b] = true;
    if (!isset($data[$b])) $data[$b] = [];
}

function shortestPath($data, $source, $target) { // Perform a BFS
    $comeFrom[$source] = null;
    $frontier = [$source];
    while (count($frontier)) {
        $nextFrontier = [];
        foreach ($frontier as $key) {
            if ($key == $target) {
                $path = [];
                while ($key) { // unwind the comeFrom info into a path
                    $path[] = $key;
                    $key = $comeFrom[$key];
                }
                return array_reverse($path); // the path needs to go from source to target
            }
            foreach ($data[$key] as $neighbor => $_) {
                if (isset($comeFrom[$neighbor])) continue;
                $comeFrom[$neighbor] = $key;
                $nextFrontier[] = $neighbor;
            }
        }
        $frontier = $nextFrontier;
    }
}

$path = shortestPath($data, "aaaaaaaa", "gggggggg");

print_r($path); // ["aaaaaaaa", "cccccccc", "eeeeeeee", "gggggg"]

edited Jan 22 '20 at 19:30

answered Jan 22 '20 at 19:05

trincot

317,000
35
244
286

Thanks - I’ll see if I get time tomorrow to try this. – Steve Ives Jan 22 '20 at 19:45
My input file has just over 1m lines (1,040,500) and is 84Mb in size, but there are only about 100,0000 unique nodes as each node can have many dependants/predecessors, each of which is on a separate line of my input file. So my current data array has one 19 byte element per line of the input file. I've just tried your routine to build the new array (this routine should be called once per record with $data defined first?) and I run out of storage BUT - I already know the start and end nodes I want, so I can modify your parser to simply reject anything else which will keep it small yes? – Steve Ives Jan 23 '20 at 09:52
Sure. Did you run into memory problems? – trincot Jan 23 '20 at 09:54
Sorry - accidentally posted 1/2 a comment. but yes I didi :-) Please see edited version. However, since I know my start node beforehand, I'll just update your function to simply reject the line if $a <> $startNode – Steve Ives Jan 23 '20 at 10:03
Actually - I tried ignoring anything where $a <> $startnode, but I actually need to work backwards from $endnode (find endnode's predecessors and work backwards until I find a path that includes startnode). I'll see if I can get this working :-) Were it not for my storage problem, then I could parse the whole input file and then filter for any entry containing $endnode, I think. – Steve Ives Jan 23 '20 at 10:28
I'm not sure what the problem is? Can you be more explicit how my code fails for your input? I don't know what you mean with `if $a <> $startNode`, since I don't have such a variable. If the problem is to work backwards, why don't you reverse the building of the data structure (swap `$a` and `$b`) and swap the `$source` and `$target`? – trincot Jan 23 '20 at 11:06
Your code is only 'failing' because I'm running out of memory. so I'm wondering if I can filter it somehow whilst building the data array. For example, the original array I created has 194,000 elements for a 211,000 line input file (this was a cut down input), so about 17,000 duplicates. I used that input array to build the new array from your code, and I ran out of storage processing the 139,000th element at which time the new array had 146,000 elements. I can save storage by not building the original array but I know know how to allocate more storage to my PHP environment. – Steve Ives Jan 23 '20 at 12:17
Can't you build the final data structure directly while reading the input, and so save the storage that you currently occupy for reading the file content into that input array? – trincot Jan 23 '20 at 12:29
1

Thanks very much - this is working now. The only problem I might have are to do with the amount of data but your code identifies the path correctly. – Steve Ives Jan 23 '20 at 13:13
Quick question - Some of the nodes have in excess of 200 parents/children and I often run out of memory during the 'shortestPath' function. (My input file is really pushing the boundary on storage as it is). Should I look into writing the 'shortestPath' array as a disk file? When finished I can delete the input array and read and draw the shortest path file? – Steve Ives Feb 07 '20 at 11:04
Alternatively, would having the list of dependents as a simple string rather than an array save much memory? – Steve Ives Feb 07 '20 at 11:15
For memory efficiency, you could first translate the node identifiers from strings to unique integers. Then do the whole search based on those integers only, and translate the final path of integers back to their names. Integers occupy less spaces than strings of 8 characters. – trincot Feb 07 '20 at 12:29

Efficient code for finding route between nodes of a data tree

1 Answers1