0

I thought I was getting the grasp of neo4j. It turns out I am not. I have a long query I'm running. When I run with any 2 of the optional matches it runs in like 20 seconds. But if I had any third optional match (doesn't seem to matter which one) it will take nearly 15 minutes to run. I can't quite figure it out. I understand (somewhat) that optional matches order matters because they take all of the already matched stuff before them and use those "rows" to check the optional match thus getting exponentially more costly with each one. I thought if I added carefully placed "with" statements between each one, I could try to filter only the things that are necessary into each one.

My optional matches don't really have THAT much to do with each other. I'd actually do it as 3-4 different neo4j queries but my boss wants me to do it all in one query. If it turns out that the performance is drastically better, I might end up defying his wishes. I'm going to give you the full query with a few names of things changed. It won't affect the query or anything, my work is technically open source but I'm still not supposed to share anything identifiable.

I also ran "Profile" to show the full tree.

profile
match (ds:Analysis)<-[:OUTPUT]-(a)<-[:INPUT]-(firstSample:Sample)<-[*]-(source:Source)
with ds, firstSample

optional match (ds)<-[*]-(othersample:Sample)
with ds, othersample, firstSample
where not othersample.location is null and not trim(othersample.location) = ''

optional match (source)-[:INPUT]->(oa)-[:OUTPUT]->(specialsample:Sample {sample_type:'protein'})-[*]->(ds)
with ds, othersample, firstSample, source, specialsample

optional match (ds)<-[*]-(finalsample:Sample)
with ds, othersample, firstSample, source, specialsample, finalsample
where not finalsample.metadata is null and not trim(finalsample.metadata) = ''


return ds.id, collect(distinct firstSample), collect(distinct source), collect(distinct othersample), collect(distinct specialsample), ds.alt_id, ds.status, ds.group_name, ds.group_uuid, 
ds.created_timestamp, ds.created_email, ds.last_modified_timestamp, ds.last_modified_email, ds.lab_id, ds.data_types, collect(distinct finalsample)

This is hooking into a python script already written so I don't really have flexibility with the outputs or even the order that they are returned, but if necessary I might be able to do something about it.

Any advice would be appreciated. https://i.stack.imgur.com/9psgT.png

Simon Thordal
  • 893
  • 10
  • 28
Derek1st
  • 63
  • 6

1 Answers1

2

A couple of things I would try.

  1. Aggregate early in your query instead of waiting until the end.

  2. Use pattern comprehensions instead of OPTIONAL MATCH where you can.

This might get you started.

match (ds:Analysis)<-[:OUTPUT]-(a)<-[:INPUT]-(firstSample:Sample)<-[*]-(source:Source)

WITH ds, 
collect(distinct firstSample) as firstSamples, 
collect(distinct source) as sources

UNWIND sources as source

OPTIONAL MATCH (source)-[:INPUT]->(oa)-[:OUTPUT]->(specialsample:Sample {sample_type:'protein'})-[*]->(ds)

WITH ds, 
firstSamples, 
collect(distinct source) AS sources, 
collect(distinct specialsample) AS specialSamples

RETURN ds.id, ds.alt_id, ds.status, ds.group_name, ds.group_uuid, 
ds.created_timestamp, ds.created_email, ds.last_modified_timestamp, 
ds.last_modified_email, ds.lab_id, ds.data_types, 
firstSamples, 
sources,
specialSamples,

apoc.coll.toSet([(ds)<-[*]-(othersample:Sample) 
where not othersample.location is null 
and not trim(othersample.location) = '' | othersample]) 
AS otherSamples,

apoc.coll.toSet([(ds)<-[*]-(finalsample:Sample) 
where not finalsample.metadata is null 
and not trim(finalsample.metadata) = '' | finalsample]) 
AS finalSamples

You can see in the profile trace that the finalsamples portion of the query involves lots of rows. Are you sure your logic is correct in what you are asking for there?

Nathan Smith
  • 881
  • 4
  • 6
  • Good stuff. I'm kinda curious about some of the things happening here. Can you tell me what the purpose of collecting sources is, and then immediately unwinding it, and then collecting it again is? Also, it looks like you're using apoc.coll to replace the latter 2 optional matches. Can you expand a little bit about why you are able to do this? – Derek1st Dec 06 '21 at 20:07
  • I only wanted to search for the specialsample pattern once for each combination of source and ds. Using `collect(distinct source)` in line 4 gives us only the unique values of source for each value of ds. We might have found multiple values of specialsample to go along with the source-ds pair, so I used a `collect(distinct source)` again after searching for the specialsamples. (By the way, I noticed an extra colon in the first `count(distinct sample)` that was a typo. I edited to remove it.) – Nathan Smith Dec 07 '21 at 02:11
  • The pattern comprehensions search for all matches for the pattern. In cases where there was more than one path from a ds to an othersample or a finalsample, you would end up with duplicate values of othersample or finalsample. The call to apoc.coll.toSet() removes the duplicates in one line. You could also unwind the list, do distinct, and the collect again. – Nathan Smith Dec 07 '21 at 02:20