3

I am currently using Janusgraph version 0.5.2. I have a graph with about 18 million vertices and 25 million edges.

I have two versions of this graph, one backed by a 3 node Cassandra cluster and another backed by 6 Cassandra nodes (both with 3x replication factor)

I am running the below query on both of them:

g.V().hasLabel('label_A').has('some_id', 123).has('data.name', 'value1').repeat(both('sample_edge').simplePath()).until(has('data.name', 'value2')).path().by('data.name').next()

The issue is that this query takes ~130ms on the 3 node cluster whereas it takes ~400ms on the 6 node cluster.

I have benchmarked around ten queries and this is the only one where there is a significant difference in performance between the two clusters.

I have tried running .profile() on both versions and the outputs are almost identical in terms of the steps and time taken:

g.V().hasLabel('label_A').has('some_id', 123).has('data.name', 'value1').repeat(both('sample_edge').simplePath()).until(has('data.name', 'value2')).path().by('data.name').limit(1).profile()

==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(label_A), o...                     1           1           4.582     0.39
    \_condition=(~label = label_A AND some_id = 123 AND data.name = value1)
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=multiKSQ[1]@8000
    \_index=someVertexByNameComposite
  optimization                                                                                 0.028
  optimization                                                                                 0.907
  backend-query                                                        1                       3.012
    \_query=someVertexByNameComposite:multiKSQ[1]@8000
    \_limit=8000
RepeatStep([JanusGraphVertexStep(BOTH,[...                     2           2        1167.493    99.45
  HasStep([data.name.eq(...                                                          803.247
  JanusGraphVertexStep(BOTH,[...                           12934       12934         334.095
    \_condition=type[sample_edge]
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c
    \_multi=true
    \_vertices=264
    optimization                                                                               0.073
    backend-query                                                    266                       5.640
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c
    optimization                                                                               0.028
    backend-query                                                  12689                     312.544
    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c
  PathFilterStep(simple)                                           12441       12441          10.980
  JanusGraphMultiQueryStep(RepeatEndStep)                           1187        1187          11.825
  RepeatEndStep                                                        2           2         810.468
RangeGlobalStep(0,1)                                                   1           1           0.419     0.04
PathStep([value(data.name)])                                 1           1           1.474     0.13
                                            >TOTAL                     -           -        1173.969        -

NOTE: You may have noticed that the profile step above shows a time taken of >1000ms. I believe this is another issue that is not related to this post.

Some other points that might be helpful:

  • The 3 and 6 node clusters are identical in terms of hardware
  • We aren't running Janusgraph in embedded mode (where it is colocated with Cassandra), instead it is running separately on its own server nodes
  • As mentioned earlier, the slowness is only observed for path queries. For instance, here's an example of another traversal query where we observe the same latency across the 3 and 6 node clusters: g.V().hasLabel('label_B').has('some_id', 123).has('data.name', 1234567).both('sample_edge').valueMap('data.field1', 'data.field2').next(10)

I'd really appreciate any input on figuring out why the query is 3x slower on 6 nodes.

Happy to provide more information as required!

Thank you!

VarunG
  • 91
  • 3

0 Answers0