I have a graph dataset with large number of relatively small disjoint graphs. I need to find all vertices reachable from a set of vertices matching certain search criteria. I use the following query:
FOR startnode IN nodes
FILTER startnode._key IN [...set of values...]
FOR node IN 0..100000 OUTBOUND startnode edges
COLLECT k = node._key
RETURN k
The query is very slow, even though it returns the correct result. This is because Arango actually ends up traversing the same subgraphs many times. For example, say there is the following subgraph:
a -> b -> c -> d -> e
When vertices a and c are selected by the filter condition, Arango ends up doing two independent traversals starting from a and c. It visits vertices d and e during both of these traversals, which wastes time. Adding uniqueVertices option doesn't help, because the vertex uniqueness is not checked across different traversals.
To confirm the impact on performance, I created an extra root document and added links from it to all the documents found by my filter:
FOR startnode IN nodes
FILTER startnode._key IN [...set of values...]
INSERT { _from: 'fakeVertices/0', _to: startnode._id } IN fakeEdges
Now the following query runs 4x faster than my original query, while producing the same result:
FOR node IN 1..1000000 OUTBOUND 'fakeVertices/0' edges, fakeEdges
OPTIONS { uniqueVertices: 'global', bfs: true }
COLLECT k = node._key
RETURN k
Unfortunately, I cannot create fake vertex/edges for all of my queries as creating it takes even more time.
My question is: does Arango provide a way to ensure uniqueness of vertices visited across all traversals in given query? If not, are there any better way to solve the problem described above?