I have a gigantic directed graph (100M+ nodes) of nodes, with multiple path instance records between sets of nodes. the path taken between any two nodes may vary, but what I'd like to find are paths that share multiple intermediary nodes except for a major deviation.
For example, I have 10 instances of a path between node A and node H. Nine of those ten path instances travel through nodes c,d,e,f - but one of the instances travels through c,d,z,e,f - I want to find that "odd" instance.
Any ideas how I would even begin to approach such a problem? Existing analytical frameworks that might be suited to the task?
Details based on comments:
- A PIR (path instance record) is a list of nodes traveled through with associated edge traversal times per edge.
- Currently, raw PIR records are in a plain string format - obviously, I would want to store it differently based on how I eventually choose to analyze it.
- This is not a route solving problem - I never need to find all possible paths; I only need to analyze taken paths (each of which is a PIR).
- The list of subpaths needs to be generated from the PIRs.
An example of a PIR would be something like: nodeA;300;nodeB;600;nodeC;100;nodeD;100;nodeF
This translates to the path of A->B-C->D->F; the cost/time of each vertice is the number - for instance, it cost 300 to go from A->B, 600 to go from B->C, and 100 to go from D->F. The cost/time of each traversal will differ each time the traversal is made. So, for instance, in one PIR, it may cost 100 to go from A->B, but in the next it may cost 150 to go from A->B.