0

I am currently running a cluster of 3 nodes with 200 mill of data and the specific vertex I'm querying a total of 25 mill vertex and 30 Mill edges. I am running the following query

g.V().hasLabel('people_node').has("age", inside(0,25)).filter(outE('posted_question').count().is(gt(1))).profile()

I have tried this query on a smaller set of ~100 vertex and edges and the profiler showed that indexes have been used for all parts of the query. However, I think the problem might be in my schema which is shown below.

Schema

schema.propertyKey('id').Text().ifNotExists().create()
schema.propertyKey('name').Text().ifNotExists().create()
schema.propertyKey('age').Int().ifNotExists().create()
schema.propertyKey('location').Point().withGeoBounds().ifNotExists().create()
schema.propertyKey('gender').Text().ifNotExists().create()
schema.propertyKey('dob').Timestamp().ifNotExists().create()

schema.propertyKey('tags').Text().ifNotExists().create()
schema.propertyKey('date_posted').Timestamp().ifNotExists().create()

schema.vertexLabel('people_node').properties('id','name','location','gender','dob').create()
schema.vertexLabel('questions_node').properties('id','tags','date_posted').create()
schema.edgeLabel('posted_question').single().connection('people_node','questions_node').create()

Indexes Used

schema.vertexLabel("people_node").index("search").search().by("name").by("age").by("gender").by("location").by("dob").ifNotExists().add()
schema.vertexLabel("people_node").index("people_node_index").materialized().by("id").ifNotExists().add()

schema.vertexLabel("questions_node").index("search").search().by("date_posted").by("tags").ifNotExists().add()
schema.vertexLabel("questions_node").index("questions_node_index").materialized().by("id").ifNotExists().add()

I have also read about "OLAP" queries I believe I have activated it but the query is still way too slow. Any advise or insight on what is slowing it down will be greatly appreciated.

Profile Statement (OLTP)

gremlin> g1.V().has("people_node","age", inside(0,25)).filter(outE('posted_question').count().is(gt(1))).profile()
==>Traversal Metrics
Step                                                               Count  Traversers
     Time (ms)    % Dur
=============================================================================================================
DsegGraphStep(vertex,[],(age < 25 & age > 0 & l...                     1           1
        38.310    25.54
  query-optimizer
         0.219
    \_condition=((age < 25 & age > 0 & label = people_node) & (true))
  query-setup
         0.001
    \_isFitted=true
    \_isSorted=false
    \_isScan=false
  index-query
        26.581
    \_indexType=Search
    \_usesCache=false
    \_statement=SELECT "community_id", "member_id" FROM "MiniGraph"."people_node_p" WHERE "solr_query" = '{"q
                ":"*:*", "fq":["age:{0 TO 25}"]}' LIMIT ?; with params (java.lang.Integer) 50000
    \_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Option
              al.empty, pagingState=null, pageSize=-1, user=Optional[cassandra], waitForSchemaAgreement=true,
               async=true}
TraversalFilterStep([DsegVertexStep(OUT,[posted...
       111.471    74.32
  DsegVertexStep(OUT,[posted_question],edge,(di...                     1           1
        42.814
    query-optimizer
         0.227
    \_condition=((direction = OUT & label = posted_question) & (true))
    query-setup
         0.036
    \_isFitted=true
    \_isSorted=false
    \_isScan=false
    vertex-query
        29.908
    \_usesCache=false
    \_statement=SELECT * FROM "MiniGraph"."people_node_e" WHERE "community_id" = ? AND "member_id" = ? AND "
                 ~~edge_label_id" = ? LIMIT ? ALLOW FILTERING; with params (java.lang.Integer) 1300987392, (j
                 ava.lang.Long) 1026, (java.lang.Integer) 65584, (java.lang.Integer) 2
    \_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Optio
               nal.empty, pagingState=null, pageSize=-1, user=Optional[cassandra], waitForSchemaAgreement=tru
               e, async=true}
    \_usesIndex=false
  RangeGlobalStep(0,2)                                                 1           1
         0.097
  CountGlobalStep                                                      1           1
         0.050
  IsStep(gt(1))
        68.209
DsegPropertyLoadStep
         0.205     0.14
                                            >TOTAL                     -           -
       149.986        -

Next, due to the partial query being much faster I assume the long time consumption is due to the necessary graph traversals. Hence, is it possible to cache or activate the indexes (_usesIndex=false) so that OLAP queries to be much faster?

WhiteSolstice
  • 641
  • 2
  • 7
  • 20

1 Answers1

1

Will you please post the output of the .profile statement?

Semanticaly, it looks like you're trying to find all "people" under the age of 25 that have more than 1 posted question. Is that accurate?

jlacefie
  • 614
  • 3
  • 5
  • I have updated the post and added the profile statement, for now this is done on a much smaller sample size of ~90 nodes in total and yes I am trying to search for all people under age of 25 which have posted more then 1 question.@Jonathan Lacefield – WhiteSolstice May 14 '19 at 05:17
  • With my personal experience in datastax graph, search or filter operation even in the medium set of data will results in the performance degradation. So its better to redirect these kind of operations to Solr queries. – ashoksl May 28 '19 at 04:15