0

I have a gigantic directed graph (100M+ nodes) of nodes, with multiple path instance records between sets of nodes. the path taken between any two nodes may vary, but what I'd like to find are paths that share multiple intermediary nodes except for a major deviation.

For example, I have 10 instances of a path between node A and node H. Nine of those ten path instances travel through nodes c,d,e,f - but one of the instances travels through c,d,z,e,f - I want to find that "odd" instance.

Any ideas how I would even begin to approach such a problem? Existing analytical frameworks that might be suited to the task?

Details based on comments:

  • A PIR (path instance record) is a list of nodes traveled through with associated edge traversal times per edge.
  • Currently, raw PIR records are in a plain string format - obviously, I would want to store it differently based on how I eventually choose to analyze it.
  • This is not a route solving problem - I never need to find all possible paths; I only need to analyze taken paths (each of which is a PIR).
  • The list of subpaths needs to be generated from the PIRs.

An example of a PIR would be something like: nodeA;300;nodeB;600;nodeC;100;nodeD;100;nodeF

This translates to the path of A->B-C->D->F; the cost/time of each vertice is the number - for instance, it cost 300 to go from A->B, 600 to go from B->C, and 100 to go from D->F. The cost/time of each traversal will differ each time the traversal is made. So, for instance, in one PIR, it may cost 100 to go from A->B, but in the next it may cost 150 to go from A->B.

Loki
  • 6,205
  • 4
  • 24
  • 36
  • [Longest Common Subsequence][1] should give you the common nodes between two paths. Not sure about how to mine them in an intelligent way after that. [1]: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem – BiGYaN May 10 '13 at 10:05
  • I've posted an answer below that should work for most cases, but "odd" isn't very well defined. However, it sounds like the main thing is that you are looking for cases where a sequence is sufficiently common, and there are one or more paths that have have a subsequence that is a small edit distance away from it. The task then is just to define exactly how common, and how small the edit distance, as well as which edit distance metric to use. – Nuclearman May 10 '13 at 12:20
  • Please edit question with answers to following: Are nodes-of-interest (eg, A and H) given for each problem instance, or does the algorithm need to seek out all pairs of nodes that have deviate paths? Solve it once only? Letting PIR=“path instance record”, is a PIR just a list of nodes? Are PIRs stored in random order in a sequentially-accessed file, or in a DB, or what? Given a pair of node names X,Y, how do you get the list of all paths between X and Y? How do you get a list of all paths that go through node C, or of all paths that go through one or more of C,D...Z, or through all of C,D...Z? – James Waldby - jwpat7 May 10 '13 at 14:30
  • What do you mean by "...with associated edge traversal times per edge."? Could you give an example or two of this (with the PIR). – Nuclearman May 11 '13 at 21:07
  • Hmmm, seems like the cost doesn't actually have an affect on what you looking for, unless definition for "major deviation" is based on the cost, in which case, my suggestion for edit distance may be insufficient. Otherwise, seems edit distance holds as valid. Also how many paths do you have? – Nuclearman May 13 '13 at 08:43
  • I don't know how many unique paths there are, actually. – Loki May 14 '13 at 18:25
  • I suppose it isn't too important unless you have a very large number of them or the ones you do have are very long (100,000s or worst), but it's probably a minor point as I'm not sure you can get an algorithm that is much more efficient. The algorithm below takes roughly O(L^2) per path, where L is the length of the path. – Nuclearman May 17 '13 at 13:34

1 Answers1

1

Go through the list of paths and break them up into sets based on the start and end node. So that for example all paths that start with the node A and end with the node B are in the same set. Then you can do the same thing with subsequences of those paths. So that for example every path with the subsequence a,b,c,d and the start node y and the end node k are in the same set. Also reversing paths as required so that for example, you don't have a set for paths k to y and a set for paths y to k. You can then check if a subsequence is common enough followed by checking if the path(s) that don't have that subsequence if there is a subsequence within that path that is sufficiently close to the original sequence based on edit distance. If you are just interested in the path, then you can simply calculate the edit distance of the path and the subsequence, subtract the difference in length, and check if result is low enough. It's probably best to use a subsequence of the path such that it starts and ends with the same node as the desired subsequence.

For your example, the algorithm would eventually reach the set of paths containing the subsequence c,d,e,f, and find that there are 9 of them. This exceeds the amount required for the subsequence to be common enough (and long enough, probably want sequences of at least length k), it would then check the paths that are not included. In this case, there are only one. It would then note, either directly or indirectly, that only only the removal of z, would make the sequence c,d,z,e,f into c,d,e,f. This passes the (currently vague) requirements for "odd", and thus the path containing c,d,z,e,f is added to the list of paths to be returned.

Nuclearman
  • 5,029
  • 1
  • 19
  • 35
  • sounds like you have a good grasp of what I'm trying to do - any suggesting on existing analytical frameworks that might get me started? – Loki May 10 '13 at 15:10
  • I'm not sure of any frameworks offhand, but anything that can break up a path (as a sequence of nodes) into all subsequences and do the edit distance calculation should work, the main issue is that the graph is rather large and you may have a similarly large amount of paths (or even just really long paths). Although waiting on answer to above comment before I say more, as it may be more complicated than I thought. – Nuclearman May 11 '13 at 21:10
  • This is specialized enough, you may just have to write the algorithm yourself. A function for breaking up the paths by start and end. A function for breaking up a path into subsequences. A function for grouping subsequences within a set of paths. Finally a function for searching for odd paths. – Nuclearman May 13 '13 at 08:49