6

Update 2011-12-28: Here's a blog post with a less vague description of the problem I was trying to solve, my work on it, and my current solution: Watching Every MLB Team Play A Game


I'm trying to solve a kind of strange pathfinding challenge. I have an acyclic directional graph, and every edge has a distance value. And I want to find a shortest path. Simple, right? Well, there are a couple of reasons I can't just use Dijkstra's or A*.

  1. I don't care at all what the starting node of my path is, nor the ending node. I just need a path that includes exactly 10 nodes. But:
  2. Each node has an attribute, let's say it's color. Each node has one of 20 different possible colors.
  3. The path I'm trying to find is the shortest path with exactly 10 nodes, where each node is a different color. I don't want any of the nodes in my path to have the same color as any other node.
  4. It'd be nice to be able to force my path to have one value for one of the attributes ("at least one node must be blue", for instance), but that's not really necessary.

This is a simplified example. My full data set actually has three different attributes for each node that must all be unique, and I have 2k+ nodes each with an average of 35 outgoing edges. Since getting a perfect "shortest path" may be exponential or factorial time, an exhaustive search is really not an option. What I'm really looking for is some approximation of a "good path" that meets the criterion under #3.

Can anyone point me towards an algorithm that I might be able to use (even modified)?


Some stats on my full data set:

  • Total nodes: 2430
  • Total edges: 86524
  • Nodes with no incoming edges: 19
  • Nodes with no outgoing edges: 32
  • Most outgoing edges: 42
  • Average edges per node: 35.6 (in each direction)
  • Due to the nature of the data, I know that the graph is acyclic
  • And in the full data set, I'm looking for a path of length 15, not 10
Plutor
  • 2,867
  • 2
  • 25
  • 29
  • Shortest path with exactly 10 nodes? I'm a bit confused, can you clarify this part? – biziclop Dec 09 '11 at 20:48
  • Your task has nothing to do with "shortness" (length), why did you mention shortest path several times? – Karoly Horvath Dec 09 '11 at 20:49
  • Sorry, I didn't mention that all of the edges also have values, as this seemed like a normal part of finding shortest paths. I'll add that now. – Plutor Dec 09 '11 at 20:53
  • Using a depth first approach, the time complexity is O(b^d), where d, the depth, is 10, and b is the breadth. I can't deduce the breadth from what you've written here. The depth is 10 from the number of nodes you've specified. – Richard Povinelli Dec 12 '11 at 04:16
  • In my full data set, the depth is 15 (I simplified the question) and (even being relatively intelligent about pruning branches), the breadth averages 35. And since I'm looking for *some* path, not necessarily starting at a specific node, an exhaustive search would involve searching from every one of the ~2000 nodes. So about 2000*35^15 possible paths. – Plutor Dec 12 '11 at 13:25
  • So exhaustive search is pretty much out of the question :-) – Richard Povinelli Dec 12 '11 at 17:24
  • Could you give an estimate on the number of *incoming* edges per node? Could you give an estimate on the number of *roots* and the number of *terminal* nodes? (I am about to create a script to generate some fake data for this problem) – wildplasser Dec 14 '11 at 14:23
  • @wildplasser: Added some stats to the post. – Plutor Dec 14 '11 at 15:26
  • Thanks. Hmm... 19 roots and only 32 terminals Typical pathlength should be about 25? ... I'll post a script here to generate the fake data. BRB BTW: what is this? protein-folding? – wildplasser Dec 14 '11 at 15:38
  • No, it's something far more prosaic. Let's just say I'm trying to schedule a road trip. The nodes are events and the colors are actually specific locations (and I want to go to 15 events in 15 different cities). – Plutor Dec 14 '11 at 16:17
  • Ah, traveling salesman with limited long-temp memory ... – wildplasser Dec 14 '11 at 17:29
  • *This can't be an intractable problem* Sure it can—given what you've told us, there's an easy objective-preserving reduction from *non-metric* TSP, which is completely inapproximable. It might help *a lot* if you could be more specific. – Per Dec 15 '11 at 01:45
  • 1
    But, if it is a TSP kind of problem, why is it a DAG? Makes no sense to me. (well: it could be a downstream boat-trip ...) – wildplasser Dec 18 '11 at 13:48
  • It's DAG because every event occurs on one specific date. If you attend any given event, you can only then attend one that come _after_ it chronologically. So it sort of is a downstream boat trip. – Plutor Dec 19 '11 at 10:23

6 Answers6

1

If the number of possible values is low, you can use the Floyd algorithm with a slight modification: for each path you store a bitmap that represents the different values already visited. (In your case the bitmap will be 20 bits wide per path.

Then when you perform the length comparison, you also AND your bitmaps to check whether it's a valid path and if it is, you OR them together and store that as the new bitmap for the path.

biziclop
  • 48,926
  • 12
  • 77
  • 104
  • Yeah, there are few enough values that a bitmap makes sense. Does the Floyd algorithm just give the you the length of the shortest path between nodes i and j, or the actual path itself? I can't see how you get a list of all of the nodes in the path out of it (which I care about). – Plutor Dec 09 '11 at 23:00
  • @Plutor The basic Floyd only gives you the length but on the page I linked to they describe how to extend it to give you all the actual paths too. – biziclop Dec 09 '11 at 23:09
  • The problem with Floyd is that it'll give me the shortest path between two nodes. It doesn't guarantee that path will be exactly n edges long. It's possible that my (giant, complex) data set has zero "shortest paths" between two nodes that have enough steps. – Plutor Dec 13 '11 at 15:03
  • @Plutor That's true. It's not a very good solution then. At least the bitmap idea is usable with most algorithms. – biziclop Dec 13 '11 at 16:28
1

It is the case when the question actually contains most of the answer.

Do a breadth-first search starting from all root nodes. When the number of parallelly searched paths exceeds some limit, drop the longest paths. Path length may be weighed: last edges may have weight 10, edges passed 9 hops ago - weight 1. Also it is possible to assign lesser weight to all paths having the preferred attribute or paths going through the weakly connected nodes. Store last 10 nodes in the path to the hash table to avoid duplication. And keep somewhere the minimum sum of the last 9 edge lengths along with the shortest path.

Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • This is an interesting strategy. I'll have to give this a try. Is there a name for it? Is this Branch-and-Bound? – Plutor Dec 14 '11 at 18:38
  • I don't know is it a classical algorithm or not. Never seen this task before. – Evgeny Kluev Dec 14 '11 at 19:21
  • It is somewhat similar to Branch-and-Bound, but still different. – Evgeny Kluev Dec 14 '11 at 19:46
  • I think this is going to end up doing it for me. For shorter paths, it's pretty dang fast (it finds great 8-node paths in about 40 seconds). The trick is going to be finding a data structure that's fast at inserts, fast at extracting the shortest partial path, and can restrict to some _m_ elements. If it weren't for that last one, a heap would be perfect. – Plutor Dec 15 '11 at 12:42
  • Most natural is a ringbuffer for exactly m elements. One such ringbuffer for each path. – Evgeny Kluev Dec 15 '11 at 13:15
0

Have you tried a straight-forward approach and failed? From your description of the problem, I see no reason a simple greedy algorithm like depth-first search might be just fine:

  • Pick a start node.
  • Check the immediate neighbors, are there any nodes that are ok to append to the path? Expand the path with one of them and repeat the process for that node.
  • If you fail, backtrack to the last successful state and try a new neighbor.
  • If you run out of neighbors to check, this node cannot be the start node of a path. Try a new one.
  • If you have 10 nodes, you're done.

Good heuristics for picking a start node is hard to give without any knowledge about how the attributes are distributed, but it is possible that it is beneficial to nodes with high degree first.

Anders Lindahl
  • 41,582
  • 9
  • 89
  • 93
  • The problem, as I say, is that my data set is actually thousands of nodes with hundreds of thousands of edges. This algorithm would give me *one valid path*, but not necessarily the shortest. I'd have to enumerate all of the valid paths (or a large number of them), which could take.. erm.. a while. – Plutor Dec 09 '11 at 22:55
  • I've now tried a depth-first search, and it's going to take tens of thousands of CPU-hours. – Plutor Dec 11 '11 at 14:59
0

It looks like a greedy depth first search will be your best bet. With a reasonable distribution of attribute values, I think finding a single valid sequence is E[O(1)] time, that is expected constant time. I could probably prove that, but it might take some time. The proof would use the assumption that there is a non-zero probability that a valid next segment of the sequence could be found at every step.

The greedy search would backtracking whenever the unique attribute value constraint is violated. The search stops when a 15 segment path is found. If we accept my hunch that each sequence can be found in E[O(1)], then it is a matter of determining how many parallel searches to undertake.

Richard Povinelli
  • 1,419
  • 1
  • 14
  • 28
0

For those who want to experiment, here is a (postgres) sql script to generate some fake data.

SET search_path='tmp';

-- DROP TABLE nodes CASCADE;
CREATE TABLE nodes
    ( num INTEGER NOT NULL PRIMARY KEY
    , color INTEGER
    -- Redundant fields to flag {begin,end} of paths
    , is_root boolean DEFAULT false
    , is_terminal boolean DEFAULT false
    );

-- DROP TABLE edges CASCADE;
CREATE TABLE edges
    ( numfrom INTEGER NOT NULL REFERENCES nodes(num)
    , numto INTEGER NOT NULL REFERENCES nodes(num)
    , cost INTEGER NOT NULL DEFAULT 0
    );

-- Generate some nodes, set color randomly
INSERT INTO nodes (num)
SELECT n
FROM generate_series(1,2430) n
WHERE 1=1
    ;
UPDATE nodes SET COLOR= 1+TRUNC(20*random() );

-- (partial) cartesian product nodes*nodes. The ordering guarantees a DAG.
INSERT INTO edges(numfrom,numto,cost)
SELECT n1.num ,n2.num, 0
FROM nodes n1 ,nodes n2
WHERE n1.num < n2.num
AND random() < 0.029
    ;

UPDATE edges SET cost = 1+ 1000 * random();

ALTER TABLE edges
    ADD PRIMARY KEY (numfrom,numto)
    ;

ALTER TABLE edges
    ADD UNIQUE (numto,numfrom)
    ;

UPDATE nodes no SET is_root = true
WHERE NOT EXISTS (
    SELECT * FROM edges ed
    WHERE ed.numfrom = no.num
    );
UPDATE nodes no SET is_terminal = true
WHERE NOT EXISTS (
    SELECT * FROM edges ed
    WHERE ed.numto = no.num
    );

SELECT COUNT(*) AS nnode FROM nodes;
SELECT COUNT(*) AS nedge FROM edges;
SELECT color, COUNT(*) AS cnt FROM nodes GROUP BY color ORDER BY color;

SELECT COUNT(*) AS nterm FROM nodes no WHERE is_terminal = true;

SELECT COUNT(*) AS nroot FROM nodes no WHERE is_root = true;

WITH zzz AS    (
    SELECT numto, COUNT(*) AS fanin
    FROM edges
    GROUP BY numto
    )
SELECT zzz.fanin , COUNT(*) AS cnt
FROM zzz
GROUP BY zzz.fanin
ORDER BY zzz.fanin
    ;

WITH zzz AS    (
    SELECT numfrom, COUNT(*) AS fanout
    FROM edges
    GROUP BY numfrom
    )
SELECT zzz.fanout , COUNT(*) AS cnt
FROM zzz
GROUP BY zzz.fanout
ORDER BY zzz.fanout
    ;

COPY nodes(num,color,is_root,is_terminal)
TO '/tmp/nodes.dmp';

COPY edges(numfrom,numto, cost)
TO '/tmp/edges.dmp';
wildplasser
  • 43,142
  • 8
  • 66
  • 109
0

The problem may be solving by dynamic programming as follows. Let's start by formally defining its solution.

Given a DAG G = (V, E), let C the be set of colors of vertices visited so far and let w[i, j] and c[i] be respectively the weight (distance) associated to edge (i, j) and the color of a vertex i. Note that w[i, j] is zero if the edge (i, j) does not belong to E. Now define the distance d for going from vertex i to vertex j taking into account C as

d[i, j, C] = w[i, j] if i is not equal to j and c[j] does not belong to C

           = 0 if i = j

           = infinite if i is not equal to j and c[j] belongs to C

We are now ready to define our subproblems as follows:

A[i, j, k, C] = shortest path from i to j that uses exactly k edges and respects the colors in C so that no two vertices in the path are colored using the same color (one of the colors in C)

Let m be the maximum number of edges permitted in the path and assume that the vertices are labeled 1, 2, ..., n. Let P[i,j,k] be the predecessor vertex of j in the shortest path satisfying the constraints from i to j. The following algorithm solves the problem.

for k = 1 to m
  for i = 1 to n
    for j = 1 to n
      A[i,j,k,C] = min over x belonging to V {d[i,x,C] + A[x,j,k-1,C union c[x]]}
      P[i,j,k] = the vertex x that minimized A[i,j,k,C] in the previous statement

Set the initial conditions as follows:

A[i,j,k,C] = 0 for k = 0
A[i,j,k,C] = 0 if i is equal to j
A[i,j,k,C] = infinite in all of the other cases

The overall computational complexity of the algorithm is O(m n^3); taking into account that in your particular case m = 14 (since you want exactly 15 nodes), it follows that m = O(1) so that the complexity actually is O(n^3). To represent the set C use an hash table so that insertion and membership testing require O(1) on average. Note that in the algorithm the operation C union c[x] is actually an insert operation in which you add the color of vertex x into the hash table for C. However, since you are inserting just an element, the set union operation leads to exactly the same result (if the color is not in the set, it is added; otherwise, it is simply discarded and the set does not change). Finally, to represent the DAG, use the adjacency matrix.

Once the algorithm is done, to find the minimum shortest path among all possible vertices i and j, simply find the minimum among the values A[i,j,m,C]. Note that if this value is infinite, then no valid shortest path exists. If a valid shortest path exists, then you can actually determine it by using the P[i,j,k] values and tracing backwards through predecessor vertices. For instance, starting from a = P[i,j,m] the last edge on the shortest path is (a,j), the previous edge is given by b = P[i,a,m-1] and its is (b,a) and so on.

Massimo Cafaro
  • 25,429
  • 15
  • 79
  • 93