High memory consumption of gremlin query on AWS Neptune

Question

I'm trying to parallelize a query to count vertices that have an edge of a given label in AWS Neptune on a large Graph by partitioning vertices by ID:

g.V().hasLabel('Label').        
      hasId(startingWith('prefix')).
      where(outE('EdgeType')).
      count()

however, the query seems to consume a lot of memory and I run into OOM exceptions. Is there an explanation for this? And, what would in general be a good strategy to parallelize/run such a query efficiently?

The graph has about ~500M vertices where ~100M have the label of interest and ~90% of those have an edge with the desired label.

`startingWith` can be expensive. How many vertices are there in the graph? Also, how many of them have that edge type? — Kelvin Lawrence, Aug 10 '22 at 22:05
@KelvinLawrence Thanks for taking a look, I added the graph size to the question. Also would be interested in a good strategy for such queries in general. — user1587520, Aug 11 '22 at 07:52
@KelvinLawrence Looking at the monitoring also the case if I run it as a single query without the filter on the vertex ID seems to consume a lot of memory , though I don't get an OOM then (likely because only a single query runs in this case on the machine). — user1587520, Aug 11 '22 at 10:45

score 1 · Accepted Answer · answered Aug 11 '22 at 13:51

Any of the text predicates (startingwith(), endingWith(), containing()) are non-indexed operations in Neptune as Neptune does not maintain a native full-text-search index. This means that any query using those may need to perform a full or range "scan" of the graph to find the results and this can expensive (as Kelvin mentions). You can, however, leverage integration with OpenSearch [1] if these types of queries are common in your use case.

Also note that Neptune's execution model is presently designed for highly-concurrent, transactional queries. If you have queries that are going to touch a large portion of the dataset, you may need to break those queries up into multiple, parallel queries. Each Neptune instance has a number of query execution threads equal to 2x the number of vCPUs on an instance. Memory allocation is divided up such that a large portion of instance memory is reserved for buffer pool cache and the remaining memory is allocated for the OS of the instance and the execution threads. Each execution thread will use a portion of the memory allocated for those threads. So an Out-Of-Memory Exception occurs when the thread runs out of memory, not the instance running out of memory. If the query execution is running out of memory, you can try increasing the instance size (to allocate more memory to the threads), or you may need to divide the query into multiple parts and execute them concurrently.

[1] https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html

To add to Taylor's comments. If a key need for your application is to essentially partition the data using a key (in your case the ID prefix) it is worth considering making that prefix a first class property and do an exact match lookup on it. That way the Neptune index can support that lookup efficiently. — Kelvin Lawrence, Aug 11 '22 at 14:33

High memory consumption of gremlin query on AWS Neptune

1 Answers1