0

I have a Neo4J instance running with the Neo4J Spatial plugin. In it, I have a graph with around 3.5 k nodes each with the same label, we'll call Basket. Each Basket relates to a physical location in the same city, and the density of these baskets is very variable. I have calculated walking times between each Basket and all of its neighbours within 600m, and stored these as non-spatial (directed) relationships between nodes. Thus, some Baskets exist as what seems to be part of a big cluster, and others exist almost on their own, with only one or almost no relationships to other Baskets.

My users have a problem: they wish to begin in one place, and end in another place, visiting an arbitrary, user-defined, number of Baskets along the way. My program aims to provide a few route options for the user (as a sequence of nodes - I'll sort the actual how-to-walk-there part later), calculating the n-th number of shortest paths.

I've written a cypher query to do this, below.

start a = node(5955), b=node(6497) 
WITH a,b 
    MATCH p=((a)-[r:IS_WALKABLE_TO*4..5]->(b)) 
RETURN p

N.B. - nodes 5955 and 6497 are two nodes I picked about 2 miles apart, in this instance I decided to opt for between 4 and 5 baskets along the way.

However, I keep running into an out of memory exception, and so would like advice on how to reduce the memory demand for this problem to make it perform on an affordable server in an acceptable time of 1 to 6 seconds.

My understanding is that Neo4j would not perform a Cartesian Product to find the solution, but kind of "pick each node and sniff around from each one until it finds a suitable-sized connection" (please, forgive my phrasing!), so I'm confused about the heap memory error.

My thoughts for improving the program are to:

  1. Somehow restrict the path-finding part of the query to nodes within a bounding box, determined by the placing of the start and end node (i.e., add 500 m in each direction, then limit the query to these nodes). However, I can't find any documentation on how to do this - is it possible without having to create another spatial layer for each query?

  2. Re-write the query in a way which doesn't create a memory error - is this doable easily?

  3. Stop using Neo4J for this entirely and write an algorithm to do it manually using an alternative language. If so, what language would you recommend? C? C++ / C#? Or could I stick with Python / Ruby / Java / Go? (or, I was even thinking I might be able to do it in PHP quite effectively but I'm not sure if that was a moment of madness).

Any help and advice about how to tackle this much appreciated!

Dom Weldon
  • 1,728
  • 1
  • 12
  • 24
  • You forgot the colon in front of `:IS_WALKABLE_TO` so you consider _ALL_ relationships between your nodes. – Michael Hunger Dec 27 '15 at 15:07
  • Which Neo4j version do you use? Would you be able to share your database? – Michael Hunger Dec 27 '15 at 15:08
  • I presume you get a few hundred million possible paths here, that's why it explodes. Can you run your query prefixed with EXPLAIN to see the potential cardinalities ? – Michael Hunger Dec 27 '15 at 15:09
  • Thanks very much for this! I'm using version 2.3.1. I'm afraid I can't release the database at the moment, but I think you're right. Because of the clustering, the number of possible paths is easily in the order 10^6 to 10^7, which definitely isn't practical for a small server demanding reasonable execution time. – Dom Weldon Dec 28 '15 at 20:01

2 Answers2

1

You might be better off refactoring this Cypher query into Java code into an unmanaged extension. Your java code might then use either Traversal API or GraphAlgoFactory.pathsWithLength()

Stefan Armbruster
  • 39,465
  • 6
  • 87
  • 97
1

I think due to the densely connected shape of your graph you easily end up with hundreds of millions of possible path due to duplicate intermediate nodes.

You should add a LIMIT 100 to your query then it stops searching for paths.

One other idea is to rewrite your query to first find distinct starting points around a (and potentially b).

start a = node(5955), b=node(6497) 
MATCH (a)-[:IS_WALKABLE_TO]->(a1)-[:IS_WALKABLE_TO]->(a2)
WITH a, b, a2, collect(a1) as first
MATCH p = shortestPath((a2)-[:IS_WALKABLE_TO*..2]->(b)) 
RETURN count(*)

// or
UNWIND first as a1
RETURN [a,a1] + nodes(p) as path
Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
  • This query works (the reason for the verbose notation is that it doesn't seem to work when I use `((a)-[:IS_WALKABLE_TO*3]->(b))`. The query above returned sensible results in 6 seconds, however, when I add another node to the path length (i.e. i4 and r5) I get an out of memory exception (not surprising given the puny server, but I couldn't justify a big server in prod so not sure scaling it up is an option). – Dom Weldon Dec 28 '15 at 20:05
  • My current thinking is to try the method suggested by Stefan Armbruster and create an extension, and to try another method completely. Since my routing engine (OSRM) is quite efficient at creating distance matrices, I might split the whole of London up into numerous overlapping tiles with pre-computed distance matrices of sensible values (i.e., actual waking times, not ones that would not work in reality), then use Neo4J spatial to decide which tiles to load at runtime, then use a mathematical solution to find the fastest acceptable routes. I'll let you know how I get on! – Dom Weldon Dec 28 '15 at 20:09
  • I had to edit my first comment, but left it >5 mins. The comment was... I used the first section of code, and expanded it a bit to make the query below: `START a = node(5955), b=node(5707) WITH a,b MATCH p=((a)-[r0:IS_WALKABLE_TO]->(i0:Basket)-[r1:IS_WALKABLE_TO]->(i1:Basket)-[r2:IS_WALKAB‌​LE_TO]->(i2:Basket)-[r3:IS_WALKABLE_TO]->(i3:Basket)-[r4:IS_WALKABLE_TO]->(b)) return p as shortestPath, reduce(total = 0, r in relationships(p)|total + r.seconds) as tt ORDER by tt asc LIMIT 10` – Dom Weldon Dec 28 '15 at 20:11