I'm struggling to write a fast query that makes use of multiple predicates using an and
step
in amazon neptune. The basic graph structure is below and is used for modelling biological data. The setup is that there are "pathways" which connect to "enzymes" which connect to "reactions" which connect to "compounds". I'm trying to filter the pathways so that only those that connect to multiple compounds get returned e.g. find the pathways that are connected to both compound 1 and compound 2.
g.addV('pathway').property('name', 'pathway 1').as('p1').
addV('pathway').property('name', 'pathway 2').as('p2').
addV('pathway').property('name', 'pathway 3').as('p3').
addV('enzyme').property('name', 'enzyme 1').as('e1').
addV('enzyme').property('name', 'enzyme 2').as('e2').
addV('enzyme').property('name', 'enzyme 3').as('e3').
addV('reaction').property('name', 'reaction 1').as('r1').
addV('reaction').property('name', 'reaction 2').as('r2').
addV('reaction').property('name', 'reaction 3').as('r3').
addV('compound').property('name', 'compound 1').as('c1').
addV('compound').property('name', 'compound 2').as('c2').
addV('compound').property('name', 'compound 3').as('c3').
addV('compound').property('name', 'compound 4').as('c4').
addV('compound').property('name', 'compound 5').as('c5').
addV('compound').property('name', 'compound 6').as('c6').
addE('contains').from('p1').to('e1').
addE('contains').from('p1').to('e2').
addE('contains').from('p1').to('e3').
addE('contains').from('p2').to('e1').
addE('contains').from('p3').to('e2').
addE('partof').from('e1').to('p1').
addE('partof').from('e2').to('e1').
addE('partof').from('e3').to('p1').
addE('partof').from('e1').to('p2').
addE('partof').from('e2').to('p3').
addE('catalyzes').from('e1').to('r1').
addE('catalyzes').from('e2').to('r2').
addE('catalyzes').from('e3').to('r3').
addE('substrate').from('c1').to('r1').
addE('product').from('r1').to('c2').
addE('substrate').from('c3').to('r2').
addE('product').from('r2').to('c4').
addE('substrate').from('c5').to('r3').
addE('product').from('r3').to('c6')
My current solution is to start at the pathway nodes and use a combination of where
and and
steps to do the filtering:
g.V().hasLabel('pathway').where(and(
out('contains').hasLabel('enzyme').
out('catalyzes').hasLabel('reaction').both().has('compound', 'name', 'compound 6'),
out('contains').hasLabel('enzyme').
out('catalyzes').hasLabel('reaction').both().has('compound', 'name', 'compound 4')
)
).valueMap().toList()
This works fine and allows me to search for any number of compounds but is slow, taking multiple seconds to run the query.
In comparison if I start at the compound node and traverse to the pathway, it's almost instantaneous, but I don't know how to replicate the multiple predicates like above:
g.V().has('compound', 'name', 'compound 6').both().
in('catalyzes').out('partof').hasLabel('pathway').dedup().valueMap().toList()
For this toy dataset both queries are fast but in my production DB with 1000 pathways, 6000 enzymes, 10000 reactions and 50000 compounds the query can take 3-5 seconds to run.
Is there an alternative in amazon neptune to the where
-and
pattern I'm using for filtering based on multiple predicates that might get better performance?