0

I have a very large data set, close to 500 million edges in which almost all edges need to be traversed. I'm trying to parallelize these traversals by trying to paginate on IDS. My strategy was to try and paginate by ID which is an MD5 hash. I tried queries like the following:

g.E().hasLabel('foo').has(id, TextP.startingWith('AAA')) for page 1 g.E().hasLabel('foo').has(id, TextP.startingWith('AAB')) for page 2

But each query seems to be doing a full scan and not just a subset. How do you recommend I go about pagination?

Omar Darwish
  • 1,536
  • 2
  • 15
  • 23

1 Answers1

1

I suggest that you run profile step on your queries to see the amount of actual traversals.

Using startingWith predicate on id doesn't seem like an optimized solution to me, since it probably uses an hash index, and not range index. I would try to prefix on other string property, or even add a random [1..n] 'replica' property and filter using .has('replica', i) to get the best performance, especially on such a large graph.

Kfir Dadosh
  • 1,411
  • 9
  • 9