0

The query starts at the vertex 'me'. I wish to find all A-vertices that are connected to my B-vertex and one of my C-vertices.
A person (like me) is always connected to exacty one B-Vertex, but several C-vertices. Also, the C-vertices connected to me are connected to possibly hundreds of A-vertices. Whereas, my B-vertex, is usually connected to less than 50 A-Vertices.
=> way more C-A edges, than B-A edges.

I developed two traversals to find all A-vertices connected to my B-vertex and C-vertices.

To make them more easy to understand, lets call those A-vertices connected to me via the B-vertex 'Ab' and those A-vertices connected to me via a C-vertex 'Ac'. Of course the two sets have an intersection, which is exactly what I'm after.

enter image description here

The first traversal uses the 'intersecting set' (Schnittmenge) between the Ac- amd Ab-vertices. First it collects all Ab and stores them with 'as()', then it collects all Ac and keeps only those equal to an Ab-vertex.

g.V('me').out('mb').out('ba').as('Ab')
    .V('me').out('mc').out('ca').as('Ac')
    .where(eq('Ab'))

variation with aggregate:

g.V('me').out('mb').out('ba').aggregate('Ab')
    .V('me').out('mc').out('ca')
    .where(within('Ab')).dedup()

The seceond uses a filter. First collecting all Ab-vertices (as this is the smaller set of the two) and then using filter, to only keep those Ab-vertices that are also connected to me via a C-vertex.

g.V('me').out('mb').out('ba')
    .filter(
        __.in('ca').in('mc').hasId('me')
    )

In my estimation, the second should be more efficient, because it traverses a smaller section of the Graph.
Am I right in this assumption? Is there a more efficient approach?

My second problem relates to the sack operator. I wish to sort the resulting set of A-vertices by the strength of the C-Path.

The first query is capable of doing that.

g.withSack(1.0f)
    .V('me').out('mb').out('ba').as('Ab')
    
    .V('me')
    .outE('mc').has('weight').sack(mult).by('weight')
    .inV().hasLabel('C')
    .outE('ca').has('weight').sack(mult).by('weight')
    .inV().hasLabel('A')
    .as('Ac')
    
    .where(eq('Ab'))
    .group().by().by(sack().sum())
    .unfold()
    .order().by(values, desc)

Is there a way to get the me-A-C sack-value in the second query as well? My only guess would be to turn the second query around: first find all Ac-vertices, note the sack-values, then remove those not part of Ab. But this would traverse a huge part of the graph. As I said above: the set of Ac-vertices counts several hundred, whereas Ab-vertices are less than fifty.

Data:

g.addV('person').property(id, 'me')
  .addV('A').property(id, 'a1')
  .addV('A').property(id, 'a2')
  .addV('A').property(id, 'a3')
  .addV('A').property(id, 'a4')
  .addV('B').property(id, 'b')
  .addV('C').property(id, 'c1')
  .addV('C').property(id, 'c2')
  .addE('mc').property(id, 'mc1').property('weight', 0.5).from(V('me')).to(V('c1'))
  .addE('mc').property(id, 'mc2').property('weight', 0.6).from(V('me')).to(V('c2'))
  .addE('mb').property(id, 'mb').from(V('me')).to(V('b'))
  .addE('ba').property(id, 'ba1').from(V('b')).to(V('a2'))
  .addE('ba').property(id, 'ba2').from(V('b')).to(V('a3'))
  .addE('ba').property(id, 'ba3').from(V('b')).to(V('a4'))
  .addE('ca').property(id, 'ca1').property('weight', 0.5).from(V('c1')).to(V('a1'))
  .addE('ca').property(id, 'ca2').property('weight', 0.7).from(V('c2')).to(V('a2'))
  .addE('ca').property(id, 'ca3').property('weight', 0.4).from(V('c2')).to(V('a3'))

(my code runs on Neptune with gremlin: {'version': 'tinkerpop-3.4.11'})

Meike
  • 171
  • 13

1 Answers1

1

There is a third option (which can parallellize retrieving B and C, depending on the TinkerPop implementation, but does not retrieve all A):

g.V('me').out('mc').as('C')
  .V('me').out('mb').out('ba')
  .where(in('ca').within('C'))

For the sack multiplication you can traverse back very fast from A to 'me', because the vertices are already in the cache of the graph system.

HadoopMarc
  • 1,356
  • 3
  • 11