0

I'm experiencing issues querying a large graph involving repeat steps that aim at making "hops" across vertices and edges. My intention is to infer indirect relationships between objects. Consider the following:

John--livesIn-->Paris

Paris--isIn-->France

What I expect to come up with is that John is based in France. Simple enough, and this works great with a small data set.

The query that I use is the following, where I make no more than 2 hops:

g.V().has('name','John')
.emit(loops().is(lt(2)))
.repeat(__.bothE().bothV().simplePath())
.inE('isIn').outV().path()

This is working as expected, until I apply this to a graph made of about 1000 vertices and 3000 edges. Then, after a few minutes, I get various kinds of error (over the REST API) with no clear logic:

  • Error: Error encountered evaluating script
  • Error: 504 Gateway Time-out
  • Error: Java heap space
  • Error

I suspect that I am doing something wrong in my query. For exemple, setting the number of "hops" to 1 (direct relationship) with .emit(loops().is(lt(1))), I would expect the results to be delivered swiftly since it would not go into the repeat loop. However, this triggers the same issue.

Many thanks for your help!

Olivier

Olivier D.
  • 164
  • 1
  • 11

1 Answers1

4

So it looks like you have a few things going on here. First let me take a shot at answering your question then let's look at why your traversal may be taking a long time to complete.

Based on your description of wanting to return John and France the following traversal should get your data:

g.V().has('name','John').as('person')
out('livesIn')
.out('isIn').as('country').select('person', 'country')

That will select all countries that a person named 'John' lives in.

Now to understand why your traversal was taking a long time. First, you are using several steps which are very memory and resource intensive such as bothE and bothV. Each of these steps navigate the relationship in both directions. Since you know the direction of the edge you are trying to traverse is out in both cases it is much quicker and less resource intensive to just use an out edge as this will traverse the specified edge name (if supplied) and end you on the adjacent vertex. Additionally, the simplePath step is another resource (specifically memory) intensive step as it must track the path value for each traverser until it contains repeated objects at which time it is dropped. This combined with the extra traversers created by the usage of loops and bothE and bothV is likely the cause of the slow query. I suspect that the query above will perform significantly better.

If you would like to see exactly what your query is doing I would suggest taking a look at the explain and profile steps which provide detailed information on your queries performance.

bechbd
  • 6,206
  • 3
  • 28
  • 47
  • Hi bechbd, and thanks a lot for this explanation. Your suggestions make a lot of sense. However, I have specific contraints: 1) Need to makes several "hops" across the graph, hence the loops/repeat step 2) Need to go in all directions (I don't really know the kind of edges I will be traversing), hence bothE().bothV().simplePath(). Any more suggestions so as to optimize my query? – Olivier D. Mar 13 '18 at 08:11
  • Look at using a repeat(both()).times(2) which will go out both directions 2 times and the both() will traverse to the adjacent vertex similar to what a bothV().otherE() would do. Your query contains a bothE().bothV() which will return you not only to the vertex on the other side but the original vertex instead. What is need for the simplePath potion of the query becasue as mentioned this is a resource intensive operation and from what you stated I see no need for it? – bechbd Mar 13 '18 at 23:26
  • Thank you again. I had looked at that initially actually. The issue with using out() or both() is that we can't keep track of the edges that were traversed. The explain() step does not deliver this information either in the output. This is why I want to use bothE().bothV() or inE().outV() alternatively. – Olivier D. Mar 14 '18 at 15:07
  • Are you trying to get the path() that was traversed or only the resulting Vertexes? – bechbd Mar 14 '18 at 20:35
  • Sorry for the delay.. I was busy with this implementation! Indeed, I need to return the full path that was traversed, and not only the vertices. This is a prerequisite for me, but that seems very (very) resource intensive. Now, question is: is there some alternative way to do it in terms of querying. – Olivier D. Mar 26 '18 at 15:08