Gremlin order() step not natively supported?

Question

I'm running into a warning when profiling a slow gremlin traversal.

WARNING: >> OrderGlobalStep([[[CoalesceStep([[VertexStep(IN,[view],edge), ProfileStep, NeptuneHasStep([isActive.eq(true)]), ProfileStep, EdgeVertexStep(OUT), ProfileStep, NeptuneHasStep([~id.eq(63b944e2-481d-42c8-a1a3-c0bc3ad24484)]), ProfileStep, RangeGlobalStep(0,1), ProfileStep, CountGlobalStep, ProfileStep], [ConstantStep(0), ProfileStep]]), ProfileStep], asc], [value(rekognitionModerationDate), desc], [value(createdDate), desc]]) << (or one of its children) is not supported natively yet

The profiler is reporting this step takes up 62% of the total execution time so I'd like to optimize it. Here is a simplified version of the complete traversal:

g.V()
.hasLabel("post")
.order()
.by(
  __.coalesce(
    __.inE("view")
    .has("isActive", true)
    .outV()
    .hasId(userId)
    .limit(1)
    .count(),
    __.constant(0)
  ),
  order.asc
)

The goal is to output post vertices that do not have an incoming view edge first. In other words show posts that haven't been viewed by the requesting user, followed by posts they have viewed. The current traversal works but is very slow. How can I refactor this to be 'native' so it will execute faster?

Edit: Apparently the problem is that Neptune doesn't have native support for order().by() with a custom comparator as explained here: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-step-support.html

I am still interested in ideas of how to refactor this for pure native support.

Kelvin Lawrence · Answer 1 · 2021-07-25T15:15:12.883

The current Amazon Neptune query engine will optimize order ... by steps in general. However, if any of the child traversals associated with the order cannot be optimized that will cause the entire step to not be optimized. As you noticed in the documentation there are limitations on what can be within the by modulator today when used with order in terms of optimization. Also worthy of note are conditions where a coalesce step will not get optimized. The query optimizer is quite good at optimizing coalesce steps but there is a case where today it does not. That case is when the LHS and RHS of the coalesce yield different types of value or a constant is used. So if a coalesce for example always yields a vertex from each possible path that will likely get optimized. However, when the RHS is a constant often that causes the coalesce to not get optimized.

You can observe this with a query such as

g.V('3').coalesce(out().count(),constant(0))

as the result from a CountGlobalStep is not the same type as the result from a ConstantStep. This does not always mean you will see bad performance but this is the reason why, in this case, you are seeing the warning in the profile. In general, when a constant is used with coalesce you will see the warning with the current version of the engine. As with many things, these are point in time behaviors.

In your specific case however I think we can simplify things potentially and get the query optimized. As you are using count if no paths exist the count will be 0 without the need for the pesky coalesce. Here is an air-routes example that gets optimized.

g.V().hasLabel('airport').
  order().
    by(in('route').count()).
  limit(10).
  project('code','count').
    by('code').
    by(in('route').count())

which yields

1   {'count': 0, 'code': 'BVS'}
2   {'count': 0, 'code': 'TWB'}
3   {'count': 0, 'code': 'EKA'}
4   {'count': 0, 'code': 'TKQ'}
5   {'count': 0, 'code': 'ISL'}
6   {'count': 0, 'code': 'RIG'}
7   {'count': 0, 'code': 'INT'}
8   {'count': 0, 'code': 'APA'}
9   {'count': 0, 'code': 'BWU'}
10  {'count': 0, 'code': 'BID'}

What about in the case of `.fold().coalesce(__.unfold().values("createdDate"), __.constant(""))`? Will a `values` step and a `constant` step return the same type? Do you have a recommendation for optimizing that case? — Fook, Aug 05 '21 at 03:03

Gremlin order() step not natively supported?

1 Answers1