0

I have a Neo4j query with searched multiple entities and I would like to pass parameters in batch using nodes object. However, I the speed of query execution is not quite high. How can I optimize this query and make its performance better?

WITH $nodes as nodes
    UNWIND nodes AS node
      with node.id AS id, node.lon AS lon, node.lat AS lat
    MATCH 
    (m:Member)-[mtg_r:MT_TO_MEMBER]->(mt:MemberTopics)-[mtt_r:MT_TO_TOPIC]->(t:Topic),
    (t1:Topic)-[tt_r:GT_TO_TOPIC]->(gt:GroupTopics)-[tg_r:GT_TO_GROUP]->(g:Group)-[h_r:HAS]->
    (e:Event)-[a_r:AT]->(v:Venue) 
    WHERE mt.topic_id = gt.topic_id AND 
    distance(point({ longitude: lon, latitude: lat}),point({ longitude: v.lon, latitude: v.lat })) < 4000 AND 
    mt.member_id = id
RETURN 
    distinct id as member_id, 
    lat as member_lat, 
    lon as member_lon, 
    g.group_name as group_name, 
    e.event_name as event_name, 
    v.venue_name as venue_name, 
    v.lat as venue_lat, 
    v.lon as venue_lon, 
    distance(point({ longitude: lon, 
    latitude: lat}),point({ longitude: v.lon, latitude: v.lat })) as distance

Query profiling looks like this:

enter image description here

Cassie
  • 2,941
  • 8
  • 44
  • 92
  • What are your indexes? One immediate suggestion is to index `mt.member_id`, and then do that match first, then put that into a with clause and match the rest next. This would cut down on a fair amount of work. The "NodeByLabelScan" for 30 million is probably what's hurting. Also, why are you matching an ID in one node to an ID in another? Why are there not relationships between these nodes? If you could create those it would get faster too. – FrobberOfBits Jul 20 '18 at 12:23
  • 1
    is t and t1 suppose to be the same Topic node? If so, why do you slit the query there? Also putting an index on `(mt:MemberTopics) mt.id` and inlining the id match for the Planner `(mt:MemberTopics {id:id})`, would reduce that part from 30mill db hits to ~1db hit. Also, can you expand the other boxes as well so that we can see what they are doing? – Tezra Jul 20 '18 at 13:30
  • 1
    After matching MemberTopics, the distance search is probably the next most expensive part, but once the planner can get the MemberTopic in 1db hit, the distance lookup should be much faster (as it won't try to do it in parallel, but only on the nodes mt is actually connected to) – Tezra Jul 20 '18 at 13:30
  • @Tezra T hanks! Creation of the index and matching on id really helped with the performance! Could you post it as an answer? – Cassie Jul 20 '18 at 14:01
  • @Tezra Also, I remove the split in a query and it started to work much faster. I used the split in the first place because without it the query didn't return any results – Cassie Jul 20 '18 at 14:02

1 Answers1

1

So, your current plan has 3 parallel threads. One we can ignore for now because it has 0db hits.

The biggest hit you are taking is the match for (mt:MemberTopics) ... WHERE mt.member_id = id. I'm guessing member_id is a unique id, so you will want to create an index on it CREATE INDEX ON :MemberTopics(member_id). That will allow Cypher to do an index lookup instead of a node scan, which will reduce the DB hits from ~30mill to ~1 (Also, in some cases, in-lining property matches is faster for more complex queries. So (mt:MemberTopics {member_id:id}) is better. It explicitly makes clear that this condition must always be true while matching, and will reinforce to use the index lookup)

The second biggest hit is the point-distance check. Right now, this is being done independently, because the node scan takes so long. Once you make the changes for MemberTopic, The planner should switch to finding all connected Venues, and then only doing the distance check on thous, so that should become cheaper as well.

Also, it looks like mt and gt are linked by a topic, and you are using a topic id to align them. If t and t1 are suppose to be the same Topic node, you could just use t for both nodes to enforce that, and then you don't need to do the id check to link mt and gt. If t and t1 are not the same node, the use of a foriegn key in your node's properties is a sign that you should have a relationship between the two nodes, and just travel along that edge (Relationships can have properties too, but the context looks a lot like t and t1 are suppose to be the same node. You can also enforce this by saying WHERE t = t1, but at that point, you should just use t for both nodes)

Lastly, Depending on the number of rows your query returns, you may want to use LIMIT and SKIP to page your results. This looks like info going to a user, and I doubt they need the full dump. So Only return the top results, and only process the rest if the user wants to see more. (Useful as results approach a metric ton) Since you only have 21 results so far, this won't be an issue right now, but keep in mind as you need to scale to 100,000+ results.

Tezra
  • 8,463
  • 3
  • 31
  • 68