4

I read Query to retrieve all paths traversable from a given vertex which describes how to find all paths from a node using gremlin. I'm trying to understand what are reasonable expectations for the performance of this query on a real-life dataset in AWS Neptune.

I've limited which edge labels are being queried by passing labels to bothE. However, I see performance degrade rapidly after 5 depth or so (I believe depending on the branching factor of the graph).

I'm mainly trying to understand what reasonable expectations are of neptune. The property graph has around ~750M nodes, ~1.5B edges, and ~1.5B properties, and is fairly interconnected. The instance type is a db.r5.4xlarge.

Thanks for any help!

Example Query: g.V('mynode').repeat(bothE('lbl1', 'lbl2', 'lbl3').otherV().simplePath()).until(__.not(bothE('lbl1', 'lbl2', 'lbl3').simplePath()).or().loops().is(eq(5))).path().count()

Example Profile is at https://pastebin.com/M6r4Xr54.

I've been profiling queries with the profile endpoint. It's been difficult for me to understand why performance is degrading, other than just the volume of data being returned, but the volume isn't that great at 5 depth)

wless1
  • 3,489
  • 1
  • 16
  • 12
  • you mentioned "volume of data" possibly being a factor – is it possible to measure how much time is spent evaluating the query vs. transmitting the results? I'm just trying to learn a bit from your situation. – Kaan Mar 29 '22 at 18:56
  • Just a comment to let you know I somehow missed your post. Taking a look now. Will report back if I see anything unusual. In general query performance is influenced by the amount of data that needs to be fetched and inspected. On the surface your query does not seem to be touching too much data. What instance size are you running on? – Kelvin Lawrence Apr 12 '22 at 13:16

0 Answers0