I thought I was getting the grasp of neo4j. It turns out I am not. I have a long query I'm running. When I run with any 2 of the optional matches it runs in like 20 seconds. But if I had any third optional match (doesn't seem to matter which one) it will take nearly 15 minutes to run. I can't quite figure it out. I understand (somewhat) that optional matches order matters because they take all of the already matched stuff before them and use those "rows" to check the optional match thus getting exponentially more costly with each one. I thought if I added carefully placed "with" statements between each one, I could try to filter only the things that are necessary into each one.
My optional matches don't really have THAT much to do with each other. I'd actually do it as 3-4 different neo4j queries but my boss wants me to do it all in one query. If it turns out that the performance is drastically better, I might end up defying his wishes. I'm going to give you the full query with a few names of things changed. It won't affect the query or anything, my work is technically open source but I'm still not supposed to share anything identifiable.
I also ran "Profile" to show the full tree.
profile
match (ds:Analysis)<-[:OUTPUT]-(a)<-[:INPUT]-(firstSample:Sample)<-[*]-(source:Source)
with ds, firstSample
optional match (ds)<-[*]-(othersample:Sample)
with ds, othersample, firstSample
where not othersample.location is null and not trim(othersample.location) = ''
optional match (source)-[:INPUT]->(oa)-[:OUTPUT]->(specialsample:Sample {sample_type:'protein'})-[*]->(ds)
with ds, othersample, firstSample, source, specialsample
optional match (ds)<-[*]-(finalsample:Sample)
with ds, othersample, firstSample, source, specialsample, finalsample
where not finalsample.metadata is null and not trim(finalsample.metadata) = ''
return ds.id, collect(distinct firstSample), collect(distinct source), collect(distinct othersample), collect(distinct specialsample), ds.alt_id, ds.status, ds.group_name, ds.group_uuid,
ds.created_timestamp, ds.created_email, ds.last_modified_timestamp, ds.last_modified_email, ds.lab_id, ds.data_types, collect(distinct finalsample)
This is hooking into a python script already written so I don't really have flexibility with the outputs or even the order that they are returned, but if necessary I might be able to do something about it.
Any advice would be appreciated. https://i.stack.imgur.com/9psgT.png