'set intersection' vs. 'has path to' and Sack()

Question

The query starts at the vertex 'me'. I wish to find all A-vertices that are connected to my B-vertex and one of my C-vertices.
A person (like me) is always connected to exacty one B-Vertex, but several C-vertices. Also, the C-vertices connected to me are connected to possibly hundreds of A-vertices. Whereas, my B-vertex, is usually connected to less than 50 A-Vertices.
=> way more C-A edges, than B-A edges.

I developed two traversals to find all A-vertices connected to my B-vertex and C-vertices.

To make them more easy to understand, lets call those A-vertices connected to me via the B-vertex 'Ab' and those A-vertices connected to me via a C-vertex 'Ac'. Of course the two sets have an intersection, which is exactly what I'm after.

The first traversal uses the 'intersecting set' (Schnittmenge) between the Ac- amd Ab-vertices. First it collects all Ab and stores them with 'as()', then it collects all Ac and keeps only those equal to an Ab-vertex.

g.V('me').out('mb').out('ba').as('Ab')
    .V('me').out('mc').out('ca').as('Ac')
    .where(eq('Ab'))

variation with aggregate:

g.V('me').out('mb').out('ba').aggregate('Ab')
    .V('me').out('mc').out('ca')
    .where(within('Ab')).dedup()

The seceond uses a filter. First collecting all Ab-vertices (as this is the smaller set of the two) and then using filter, to only keep those Ab-vertices that are also connected to me via a C-vertex.

g.V('me').out('mb').out('ba')
    .filter(
        __.in('ca').in('mc').hasId('me')
    )

In my estimation, the second should be more efficient, because it traverses a smaller section of the Graph.
Am I right in this assumption? Is there a more efficient approach?

My second problem relates to the sack operator. I wish to sort the resulting set of A-vertices by the strength of the C-Path.

The first query is capable of doing that.

g.withSack(1.0f)
    .V('me').out('mb').out('ba').as('Ab')
    
    .V('me')
    .outE('mc').has('weight').sack(mult).by('weight')
    .inV().hasLabel('C')
    .outE('ca').has('weight').sack(mult).by('weight')
    .inV().hasLabel('A')
    .as('Ac')
    
    .where(eq('Ab'))
    .group().by().by(sack().sum())
    .unfold()
    .order().by(values, desc)

Is there a way to get the me-A-C sack-value in the second query as well? My only guess would be to turn the second query around: first find all Ac-vertices, note the sack-values, then remove those not part of Ab. But this would traverse a huge part of the graph. As I said above: the set of Ac-vertices counts several hundred, whereas Ab-vertices are less than fifty.

Data:

g.addV('person').property(id, 'me')
  .addV('A').property(id, 'a1')
  .addV('A').property(id, 'a2')
  .addV('A').property(id, 'a3')
  .addV('A').property(id, 'a4')
  .addV('B').property(id, 'b')
  .addV('C').property(id, 'c1')
  .addV('C').property(id, 'c2')
  .addE('mc').property(id, 'mc1').property('weight', 0.5).from(V('me')).to(V('c1'))
  .addE('mc').property(id, 'mc2').property('weight', 0.6).from(V('me')).to(V('c2'))
  .addE('mb').property(id, 'mb').from(V('me')).to(V('b'))
  .addE('ba').property(id, 'ba1').from(V('b')).to(V('a2'))
  .addE('ba').property(id, 'ba2').from(V('b')).to(V('a3'))
  .addE('ba').property(id, 'ba3').from(V('b')).to(V('a4'))
  .addE('ca').property(id, 'ca1').property('weight', 0.5).from(V('c1')).to(V('a1'))
  .addE('ca').property(id, 'ca2').property('weight', 0.7).from(V('c2')).to(V('a2'))
  .addE('ca').property(id, 'ca3').property('weight', 0.4).from(V('c2')).to(V('a3'))

(my code runs on Neptune with gremlin: {'version': 'tinkerpop-3.4.11'})

an answer to one of the two questions would already help :) – Meike Feb 25 '22 at 11:29 — Meike, Feb 25 '22 at 11:29

score 1 · Answer 1 · answered Feb 27 '22 at 10:58

There is a third option (which can parallellize retrieving B and C, depending on the TinkerPop implementation, but does not retrieve all A):

g.V('me').out('mc').as('C')
  .V('me').out('mb').out('ba')
  .where(in('ca').within('C'))

For the sack multiplication you can traverse back very fast from A to 'me', because the vertices are already in the cache of the graph system.

'set intersection' vs. 'has path to' and Sack()

1 Answers1