Gremlin correlated queries kill performance

Question

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.

Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...

g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")

That query is 10x more expensive than simply getting the edges using:

g.V().outE("someLabel").has("someProperty","someValue")

So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.

I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.

Beans · Accepted Answer · 2019-10-24T21:45:25.937

OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...

g.inject(true).
union(
  __.V().not(outE("someLabel")).constant().as("ridiculous"),
  __.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")

In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.

Kfir Dadosh · Answer 2 · 2019-10-25T11:43:27.620

0

Your original query returned vertex-edge pairs, where as your answer returns only edges.

You could just run g.E().hasLabel("somelabel") to get the same result.

Probably a faster alternative to your original query might be:

g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")

Or

g.V().as("v").outE("somelabel").as("e").select("v","e")

edited Oct 25 '19 at 11:43

answered Oct 24 '19 at 22:49

Kfir Dadosh

1,411
9
9

"Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank." Dropping the vertex part would be awesome, if I didn't need the vertices. That snippet is part of a larger, very complicated query. – Beans Oct 25 '19 at 08:37
So just reverse it. I will add it to my answer. – Kfir Dadosh Oct 25 '19 at 11:39
First of all, it returns the vertex-edge pair like you requested in your original question, while filtering the vertices with no edge, as you did by returning empty constant(). If you care about the other vertices, there are number of ways you can do it, but I won't spend any more time answering you. Good Luck! – Kfir Dadosh Oct 25 '19 at 15:46

score 0 · Answer 3 · answered Oct 25 '19 at 11:47

0

If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem

Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.

I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.

This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

answered Oct 25 '19 at 11:47

stephen mallette

45,298
5
67
135

Thanks for the answer, the questions are related; all part of one performance optimization exercise. I'm breaking things down into parts and looking for performance killers, this is one. The more I dig, the more I find that it's just faster to have Gremlin go through one route and then do subsequent queries to get the additional data. – Beans Oct 25 '19 at 12:31
That could be the case. There are times where two separate queries might outperform one. I wouldn't have expected this one to be the case though, but as I've stated elsewhere I don't know CosmosDB well. If you do two traversals, you might consider passing the vertex list to `g.V(verticesFromFirstQuery)` rather than doing `g.V().hasLabel()` again. Most graphs would perform better under the former than the latter. – stephen mallette Oct 25 '19 at 12:42
Completely agree with you about sending vertex IDs from first query. I've landed up at that solution in my other question about scalability. – Beans Oct 25 '19 at 12:59

Gremlin correlated queries kill performance

3 Answers3