Alternative to combining where & and steps in gremlin query

Question

I'm struggling to write a fast query that makes use of multiple predicates using an and step in amazon neptune. The basic graph structure is below and is used for modelling biological data. The setup is that there are "pathways" which connect to "enzymes" which connect to "reactions" which connect to "compounds". I'm trying to filter the pathways so that only those that connect to multiple compounds get returned e.g. find the pathways that are connected to both compound 1 and compound 2.

g.addV('pathway').property('name', 'pathway 1').as('p1').
  addV('pathway').property('name', 'pathway 2').as('p2').
  addV('pathway').property('name', 'pathway 3').as('p3').
  addV('enzyme').property('name', 'enzyme 1').as('e1').
  addV('enzyme').property('name', 'enzyme 2').as('e2').
  addV('enzyme').property('name', 'enzyme 3').as('e3').
  addV('reaction').property('name', 'reaction 1').as('r1').
  addV('reaction').property('name', 'reaction 2').as('r2').
  addV('reaction').property('name', 'reaction 3').as('r3').
  addV('compound').property('name', 'compound 1').as('c1').
  addV('compound').property('name', 'compound 2').as('c2').
  addV('compound').property('name', 'compound 3').as('c3').
  addV('compound').property('name', 'compound 4').as('c4').
  addV('compound').property('name', 'compound 5').as('c5').
  addV('compound').property('name', 'compound 6').as('c6').
  addE('contains').from('p1').to('e1').
  addE('contains').from('p1').to('e2').
  addE('contains').from('p1').to('e3').
  addE('contains').from('p2').to('e1').
  addE('contains').from('p3').to('e2').
  addE('partof').from('e1').to('p1').
  addE('partof').from('e2').to('e1').
  addE('partof').from('e3').to('p1').
  addE('partof').from('e1').to('p2').
  addE('partof').from('e2').to('p3').
  addE('catalyzes').from('e1').to('r1').
  addE('catalyzes').from('e2').to('r2').
  addE('catalyzes').from('e3').to('r3').
  addE('substrate').from('c1').to('r1').
  addE('product').from('r1').to('c2').
  addE('substrate').from('c3').to('r2').
  addE('product').from('r2').to('c4').
  addE('substrate').from('c5').to('r3').
  addE('product').from('r3').to('c6')

My current solution is to start at the pathway nodes and use a combination of where and and steps to do the filtering:

g.V().hasLabel('pathway').where(and(
  out('contains').hasLabel('enzyme').
    out('catalyzes').hasLabel('reaction').both().has('compound', 'name', 'compound 6'),
  out('contains').hasLabel('enzyme').
    out('catalyzes').hasLabel('reaction').both().has('compound', 'name', 'compound 4')
  )
).valueMap().toList()

This works fine and allows me to search for any number of compounds but is slow, taking multiple seconds to run the query.

In comparison if I start at the compound node and traverse to the pathway, it's almost instantaneous, but I don't know how to replicate the multiple predicates like above:

g.V().has('compound', 'name', 'compound 6').both().
  in('catalyzes').out('partof').hasLabel('pathway').dedup().valueMap().toList()

For this toy dataset both queries are fast but in my production DB with 1000 pathways, 6000 enzymes, 10000 reactions and 50000 compounds the query can take 3-5 seconds to run.

Is there an alternative in amazon neptune to the where-and pattern I'm using for filtering based on multiple predicates that might get better performance?

score 1 · Answer 1 · answered Jul 08 '20 at 07:25

Since the anonymous traversals inside the and step are basically the same. You can replace it with within and count the different values:

g.V().hasLabel('pathway').where(
  out('contains').hasLabel('enzyme').
    out('catalyzes').hasLabel('reaction').
    both().has('compound', 'name', within('compound 6', 'compound 4'))
    .values('name').dedup().count().is(2)
  ).valueMap()

example: https://gremlify.com/c78cabauv7q

If you get a better performance starting at the "compound" vertex you can try something like this:

g.V().
  has('compound', 'name', within('compound 6', 'compound 4')).as('compound').
  both().in('catalyzes').in('contains').hasLabel('pathway').
  group().
    by().
    by(select('compound').values('name').dedup().count()).
  unfold().
  where(select(values).is(2)).select(keys).
  valueMap()

example: https://gremlify.com/c78cabauv7q/1

Alternative to combining where & and steps in gremlin query

1 Answers1