4

I'm trying to accomplish a pretty common task. I have a substantial dataset in a Neo4J database and, from a RESTful web service, i want to return the data in chunks of 25 nodes. My model is quite simple:

(:Tenant {Hash:''})-[:owns]->(:Asset {Hash:'', Name:''})

I have unique constraints on the Hash properties on both labels.

If i wanted to obtain the 101th data page, my cypher query would look like this:

MATCH (:Tenant {Hash:'foo'})-[:owns]->(a:Asset)
RETURN a
ORDER BY a.Hash
SKIP 2500
LIMIT 25

My dataset consists of a single tenant, with ~75K assets. The above query takes ~30(!) seconds to complete. I also notice is that the further i advance in the data (ie. higher SKIP) the longer it takes for the query to return.

I quickly figured out that the culprit of my performance issues is the ORDER BY a.Hash. When i remove it, the query returns with sub-second results. This is actually quite a surprise, as i'd expect the index itself to also be ordered.

Obviously, in order to implement sensible pagination, i must have a consistent sort order.

  • Any tips on making this query perform?
  • Alternative suggestions for paging? I can see adding dedicated page nodes, but that will become difficult to maintain.
  • What is the default sort order anyway, and is it consistent?
Geoffrey Braaf
  • 254
  • 2
  • 12
  • Index order won't matter since you don't retrieve assets from index but via relationship, so the relevant order would be relationship order in the node, i.e. `[:owns]` by `(:Tenant)`. Does query speed improve on second run? What if you order by id, i.e. `ORDER BY ID(a)` or order by the relationship, i.e. bind `[o:owns]` and do `ORDER BY ID(o)`? – jjaderberg Mar 21 '14 at 13:00
  • Order by relationship.id might be the way to go. Agree with jjaderberg that this would better exploit the way the data is actually indexed. Also, you're presently ordering by "hash" which while it should be a consistent ordering, doesn't sound like it's any more meaningful than ordering by relationship.id. – FrobberOfBits Mar 21 '14 at 13:10
  • @jjaderberg yes, performance is slightly better after the first run, but not substantially. i did actually attempt to sort by `id(a)`, but that resulted in way worst performance, and with higher (say 2500) `SKIP` counts even errors. – Geoffrey Braaf Mar 21 '14 at 13:19
  • @FrobberOfBits i totally agree that sorting on the hash is useless from a user's point of view. i do it anyway, because i want pagination to be consistent. given i'm unsure the default sort order is consistent, i need to have some sort of guarantee right? Given a stable set of nodes i want the same nodes to return on the same data page everytime. – Geoffrey Braaf Mar 21 '14 at 13:23
  • Geoffrey, what is the use-case for 2500 pages? No user looks that far? How about passing in the last hash that you got, and instead of the skip use a `WHERE a.Hash > {Hash} RETURN a LIMIT 25` That should be much faster. – Michael Hunger Mar 23 '14 at 19:25
  • @Michael first off, this is an example, actual page sizes are going to be a lot bigger (say 500-1000). Secondly, i'm not accessing page 2500, but (2500/25+1) page 101. Context: this is a means for clients of my api to download their data, which is encrypted in my database, and decrypted only transiently, when someone does the request. I cannot prepare the download file beforehand, as only they poses part of the keys required. So what i'm looking for is *not* a sensible way to page from a user's point of view, but a consistent order in which i can retrieve the data. Make sense? – Geoffrey Braaf Mar 24 '14 at 07:16
  • Thanks @GeoffreyBraaf for reporting this, see my answer about the final resolution. https://github.com/neo4j/neo4j/pull/2230 – Michael Hunger Apr 04 '14 at 00:55

2 Answers2

3

Hey @GeoffreyBraaf found some time this week to look at your issue, you are right, there were some implementation issue that made this unnecessarily slow.

I used Timmy's suggestion to implement a Java version which finished in 30ms. The Cypher version took 100 seconds. Working on implementation of the top-n select in Cypher improved it massively by a factor of 600. So Cypher now takes about 150ms for that query.

See: https://gist.github.com/jexp/9954509

The work is already merged in 2.0-maint and will be released as part of 2.0.2

See: https://github.com/neo4j/neo4j/pull/2230

tstorms
  • 4,941
  • 1
  • 25
  • 47
Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
2

Although I'm actually not a big fan of my own solution, but you get surprisingly good results when you code the retrieval and paging of nodes, even with relatively large datasets. If you're using java, could you give the following a spin?

 try (Transaction tx = graphDb.beginTx()) {
     List<Node> nodes = IteratorUtil.asList(GlobalGraphOperations.at(graphDb)
         .getAllNodesWithLabel(DynamicLabel.label("Asset")));
     Collections.sort(nodes, new NodeComparator());
     final List<Node> result = nodes.subList(4375, 4400);
     tx.success();
 }

 static class NodeComparator implements Comparator<Node> {   
     @Override
     public int compare(final Node o1, final Node o2) {
         return o1.getProperty("hash").toString().compareTo(o2.getProperty("hash").toString());
     }            
 }
tstorms
  • 4,941
  • 1
  • 25
  • 47
  • so are you suggesting i load *all* nodes and then do paging and sorting in-memory? if so, i really don't think that will scale. I am using .NET by the way, and i connect through the REST endpoint. – Geoffrey Braaf Mar 21 '14 at 14:08
  • You could implement this as a server plugin, so it doesn't matter which programming language you use. I know this solution seems weird, but you'd be surprised of the results. – tstorms Mar 21 '14 at 14:10
  • not sure i follow. this is something that would run on my Neo4J server? – Geoffrey Braaf Mar 21 '14 at 15:07
  • Ideally yes. See http://docs.neo4j.org/chunked/stable/server-plugins.html for more information on plugins. You can easily call a custom plugin with the REST API. Your plugin would only need to a accept a page number, do the retrieval like above and return the results. – tstorms Mar 21 '14 at 15:09
  • Even though I haven't been able to try this out, I did receive a (very) similar suggestion from Michael Hunger via Twitter. Apparently server side extensions are the way to go in a scenario as this where paging a large dataset is required. Not sure about the exact implementation details, but marking this as the answer. – Geoffrey Braaf Mar 27 '14 at 09:37