0

I am trying to solve a performance issue with a traversal and have tracked it down to the order().by() step. It seems that order().by() greatly increases the number of statement index ops required (per the profiler) and dramatically slows down execution.

A non-ordered traversal is very fast:

g.V().hasLabel("post").limit(40)

execution time: 2 ms

index ops: 1

Adding a single ordering step adds thousands of index ops and runs much slower.

g.V().hasLabel("post").order().by("createdDate", desc).limit(40)

execution time: 62 ms

index ops: 3909

Adding a single filtering step adds thousands more index ops and runs even slower:

g.V().hasLabel("post").has("isActive", true).order().by("createdDate", desc).limit(40)

execution time: 113 ms

index ops: 7575

However the same filtered traversal without ordering runs just as fast as the original unfiltered traversal:

g.V().hasLabel("post").has("isActive", true).limit(40)

execution time: 1 ms

index ops: 49

By the time we build out the actual traversal we run in production there are around 12 filtering steps and 4 by() step-modulators causing the traversal to take over 6000 ms to complete with over 33000 index ops. Removing the order().by() steps causes the same traversal to run fast (500 ms).

The issue seems to be with order().by() and the number of index ops required to sort. I have seen the performance issue noted here but adding barrier() did not help. The traversal is also fully optimized requiring no Tinkerpop conversion.

I am running engine version 1.1.0.0 R1. There are about 5000 post vertices.

How can I improve the performance of this traversal?

Fook
  • 5,320
  • 7
  • 35
  • 57

1 Answers1

0

So generally the only way you are going to increase the performance of ordering in any graph database (including Neptune) is to filter items down to a minimal set prior to performing the ordering.

An order().by() step requires that all elements that match the criteria must be returned for them to be ordered as specified. When using only a limit(N) step then as soon as N items are returned the traversal terminates. This is why you are seeing significantly faster times for the limit() only option, it just has to process and return less data since it is returning the first N records which may be in any order.

bechbd
  • 6,206
  • 3
  • 28
  • 47