I have 3 node in datastax enterprise and loaded 65 million vertices and edges on these. when i use dse studio or gremlin console and run gremlin query on my graph the query is too slow. I defined any kind of index and test again but Had no effect. when i run query for example "g.v().count()" cpu usage and cpu load average no much change while if i run cql query, that distribute on all nodes and cpu usage and cpu load average on all nodes are a significant change what is best practice or best configurations for efficient gremlin query in this case?
-
Please post more detail - query code etc. – Peter May 15 '18 at 08:58
-
for example " g.V().count() " or any other query – ahmad May 15 '18 at 10:11
-
again...more details. what version of DSE Graph? you say "any other query", but you're only going to get answers related to `g.V().count()` - is that what you want? what kind of speed are you getting now from that traversal? – stephen mallette May 15 '18 at 10:39
-
DSE graph version is 5.1 and for only 100,000 vertices and edges the time for query " g.E().count() " is 50 seconds and for query " g.V().count() " is 10 seconds. for 65 million vertices i have error for query " g.V().count() " – ahmad May 15 '18 at 13:38
2 Answers
count()
based traversals should be executed via OLAP with Spark for graphs of the size you are working with. If you using standard OLTP based traversals you can expect long wait times for this type of query.
Note that this rule holds true for any graph computation that must do a "table scan" (i.e. touch all or a very large portion of vertices/edges in the graph). This issue is not specific to DSE Graph either and will apply to virtually any graph database.

- 45,298
- 5
- 67
- 135
After many tests on different queries I came to this conclusion that it seems the gremlin has a problem with count query on million vertices while when you define a index on property of vertices and find specific vertix for example: g.V().hasLabel('member').has('C_ID','4242833')
the time of this query lower than 1 second and this is acceptable. question is here why gremlin has problem with count query on million vertices?
-
1This is not related to Gremlin per se, but to the nature of the `count()` query which ultimately translates to actions that must be done in the physical world. A `count()`, unless optimized by the database via some kind of counter, would translate roughly into a full linear scan of the underlying storage device / hard drive (assuming a single node cluster). Things get worse in a distributed system, since a `count()` also requires to scan all nodes in your Cassandra cluster. As Stephen said, this is easily done using Gremlin OLAP though. You may have to maintain your own counter. – jbmusso Oct 30 '18 at 10:04