1

These queries are all logically equivalent returning the same 6 results (except the last which returns only 5 results), but performance is very different ranging from 31 ms to 45 seconds. I'm using Neo4j 2.0.2. I have an index ON :SEGMENT(propertyId), but the lookup of (n) is not why the query is slow.

match (n {productId:6122})<-[:PARENT_OF*]-(p) return n,p;
[...]
6 rows
43879 ms

match (n:SEGMENT {productId:6122})<-[:PARENT_OF*]-(p) return n,p;
[...]
6 rows
44926 ms

start n=node(111426) match (n)<-[:PARENT_OF*]-(p) return n,p;
[...]
6 rows
31 ms

match (n {productId:6122}) match path=(n)<-[:PARENT_OF*]-(p) return path;
[...]
6 rows
694 ms

match (n:SEGMENT {productId:6122}) match path=(n)<-[:PARENT_OF*]-(p) return path;
[...]
6 rows
161 ms

match (n:SEGMENT)<-[:PARENT_OF*]-(p:SEGMENT) where n.productId=6122 return n,p;
[...]
5 rows
45332 ms

Added PROFILE output:

PROFILE match (n:SEGMENT {productId:6122})<-[:PARENT_OF*]-(p:SEGMENT) return n,p;
`ColumnFilter(symKeys=["n", "p", "  UNNAMED34"], returnItemNames=["n", "p"], _rows=5, _db_hits=0)
Filter(pred="(hasLabel(n:SEGMENT(0)) AND Property(n,productId(9)) == Literal(6122))", _rows=5, _db_hits=1895169)
  TraversalMatcher(start={"label": "SEGMENT", "producer": "NodeByLabel", "identifiers": ["p"]}, trail="(p)-[:PARENT_OF*1..]->(n)", _rows=1895169, _db_hits=1895169)`

PROFILE match (n {productId:6122}) match path=(n)<-[:PARENT_OF*]-(p) return path; 
`ColumnFilter(symKeys=["n", "p", "  UNNAMED41", "path"], returnItemNames=["path"], _rows=6, _db_hits=0)
ExtractPath(name="path", patterns=["ParsedVarLengthRelation(  UNNAMED41,Map(),ParsedEntity(n,n,Map(),List()),ParsedEntity(p,p,Map(),List()),List(PARENT_OF),INCOMING,false,None,None,None)"], _rows=6, _db_hits=0)
  PatternMatcher(g="(n)-['  UNNAMED41']-(p)", _rows=6, _db_hits=0)
    Filter(pred="Property(n,productId(9)) == Literal(6122)", _rows=1, _db_hits=48531)
      AllNodes(identifier="n", _db_hits=48531, _rows=48531, identifiers=["n"], producer="AllNodes")`
Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
Sean Timm
  • 80
  • 6
  • Try using PROFILE keyword to compare the execution plans. An example here: http://stackoverflow.com/questions/17760627/understanding-neo4j-cypher-profile-keyword-and-execution-plan. – fbiville Apr 30 '14 at 14:29
  • 1
    the third one is fast because of the usage of the internal id. Now - this may just be a typo - but you mention an index on 'propertyId', but the queries are looking for 'productId'.... – Dan G Apr 30 '14 at 14:39
  • 1
    Yes, a typo. Good catch. With the index, 'match (n:SEGMENT {productId:6122})<-[:PARENT_OF*]-(p) return n,p; 6 rows 40 ms' – Sean Timm Apr 30 '14 at 14:50
  • The lookup using the internal id takes about 10 ms. Same as when I had the proper index on :SEGMENT(productId). Without an index, the lookup was 750ms. – Sean Timm Apr 30 '14 at 15:26

1 Answers1

2

The fastest query is the internal ID lookup, which is not surprising. The ID value itself (avoid it as an external identifier) is tightly coupled to the stored data structure. It is roughly equivalent to telling Cypher where the node is in the node store file. (*)

For the next two fastest ones, I might be totally mistaken, but I think they are faster because you match a path only, although I'm not sure how exactly this influences the query behaviour. The small delta between the two can be explained by the fact than one query is using the schema index under the hood, while the other is not (as the label isn't specified in the second case).

For the last 3 ones, it might be that the start point location lookup time is very irrelevant compared to the depth of your relationships PARENT_OF. You may end up traversing long paths, I'm not sure.

(*) Still I don't understand how just a lookup by ID of the start node would explain such a difference with the similar 2 slowest queries (they also don't match by path...)

fbiville
  • 8,407
  • 7
  • 51
  • 79
  • Thanks for the tip on PROFILE. It looks like the fast queries evaluate ~48K paths. The slow ones evaluate ~1.8M paths. – Sean Timm Apr 30 '14 at 15:16