0

I want to find nodes who should be linked to a given node, where the link is defined by some logic, which uses the nodes' and existing edges' attribute with the following logic:

A) (The pair has the same zip (node attribute) and name_similarity (edge attribute) > 0.3 OR

B) The pair has a different zip and name_similarity > 0.5 OR

C) The pair has an edge type "external_info" with value = "connect")

D) AND (the pair doesn't have an edge type with "external info" with value = "disconnect")

In short: (A | B | C) & (~D)

I'm still a newbie to gremlin, so I'm not sure how I can combine several conditions on edges and nodes.

Below is the code for creating the graph, as well as the expected results for that graph:

# creating nodes

(g.addV('person').property('name', 'A').property('zip', '123').
addV('person').property('name', 'B').property('zip', '123').
addV('person').property('name', 'C').property('zip', '456').
addV('person').property('name', 'D').property('zip', '456').
addV('person').property('name', 'E').property('zip', '123').
addV('person').property('name', 'F').property('zip', '999').iterate())

node1 = g.V().has('name', 'A').next()
node2 = g.V().has('name', 'B').next()
node3 = g.V().has('name', 'C').next()
node4 = g.V().has('name', 'D').next()
node5 = g.V().has('name', 'E').next()
node6 = g.V().has('name', 'F').next()

# creating name similarity edges

g.V(node1).addE('name_similarity').from_(node1).to(node2).property('score', 1).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node3).property('score', 0.2).next() # under threshold
g.V(node1).addE('name_similarity').from_(node1).to(node4).property('score', 0.4).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node5).property('score', 1).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node6).property('score', 0).next() # under threshold

# creating external output edges

g.V(node1).addE('external_info').from_(node1).to(node5).property('decision', 'connect').next() 
g.V(node1).addE('external_info').from_(node1).to(node6).property('decision', 'disconnect').next() 

The expected output - for input node A - are nodes B (due to condition A), D (due to Condition B), and F (due to condition C). node E should not be linked due to condition D.

I'm looking for a Gremlin query that will retrieve these results.

DvirNa
  • 3
  • 2

1 Answers1

0

Something seemed wrong in your data given the output you expected so I had to make corrections:

  • Vertex D wouldn't appear in the results because "score" was less than 0.5
  • "external_info" edges seemed reversed

Here's the data I used:

g.addV('person').property('name', 'A').property('zip', '123').
addV('person').property('name', 'B').property('zip', '123').
addV('person').property('name', 'C').property('zip', '456').
addV('person').property('name', 'D').property('zip', '456').
addV('person').property('name', 'E').property('zip', '123').
addV('person').property('name', 'F').property('zip', '999').iterate()
node1 = g.V().has('name', 'A').next()
node2 = g.V().has('name', 'B').next()
node3 = g.V().has('name', 'C').next()
node4 = g.V().has('name', 'D').next()
node5 = g.V().has('name', 'E').next()
node6 = g.V().has('name', 'F').next()
g.V(node1).addE('name_similarity').from(node1).to(node2).property('score', 1).next() 
g.V(node1).addE('name_similarity').from(node1).to(node3).property('score', 0.2).next() 
g.V(node1).addE('name_similarity').from(node1).to(node4).property('score', 0.6).next() 
g.V(node1).addE('name_similarity').from(node1).to(node5).property('score', 1).next() 
g.V(node1).addE('name_similarity').from(node1).to(node6).property('score', 0).next() 
g.V(node1).addE('external_info').from(node1).to(node6).property('decision', 'connect').next() 
g.V(node1).addE('external_info').from(node1).to(node5).property('decision', 'disconnect').next() 

I went with the following approach:

gremlin> g.V().has('person','name','A').as('a').
......1>   V().as('b').
......2>   where('a',neq('b')).
......3>   or(where('a',eq('b')).                                                    // A
......4>        by('zip').
......5>      bothE('name_similarity').has('score',gt(0.3)).otherV().where(eq('a')), 
......6>      bothE('name_similarity').has('score',gt(0.5)).otherV().where(eq('a')), // B
......7>      bothE('external_info').                                                // C
......8>        has('decision','connect').otherV().where(eq('a'))).
......9>   filter(__.not(bothE('external_info').                                     // D
.....10>                 has('decision','disconnect').otherV().where(eq('a')))).
.....11>   select('a','b').
.....12>    by('name')
==>[a:A,b:B]
==>[a:A,b:D]
==>[a:A,b:F]

I think this contains all the logic you were looking for, but I didn't spend a lot of time optimizing it as I don't think any optimization will get around the pain of the full graph scan of V().as('b'), so either your situation involves a relatively small graph (in-memory perhaps) and this query will work or you would need to find another method all together. Perhaps you have methods to further limit "b" which might help? If something along those lines is possible, I'd probably try to better define directionality of edge traversals to avoid bothE() and instead limit to outE() or inE() which would get rid of otherV(). Hopefully you use a graph that allows for vertex centric indices which would speed up those edge lookups on "score" as well (not sure if that would help much on "decision" as it has low selectivity).

stephen mallette
  • 45,298
  • 5
  • 67
  • 135
  • Thank you Stephen! 1) As per your question, I intend to use it in Neptune graph DB. This is the first time I work with a graph DB, so I'm not sure what is the definition of "small". I expect there to be <100K nodes, and some hundreds of thousands of edges. Does this qualify as small, or what does? 2) How could I use this traversal in order to find ALL nodes that are both directly and *indirectly* linked to node X (by the same defining logic of the connection) – DvirNa Aug 09 '20 at 10:56
  • (number of hops is unknown and unlimited) – DvirNa Aug 09 '20 at 14:54
  • i tend to think of "small" as something that can fit in memory easily. i'd say hundreds of thousands of edges could be just beyond small for all but the largest machines with a tone of memory on it. as for your second question, marked (2), i'm not sure what you mean by "directly and indirectly linked to node X". – stephen mallette Aug 09 '20 at 16:56
  • Assuming a new graph where node A is connected to node B (say by condition B, name similarity = 0.9); and node B is connected to node C (say by condition C, external_info), but no edge between node A and C. Meaning A - B- C. How can I use your traversal, so that given node A, I'll get both nodes B and C? (and all the nodes that are connected to B and C if exist) Thanks – DvirNa Aug 10 '20 at 06:32
  • If I had concerns about this traversal earlier in terms of performance, I'd say adding unlimited number of hops for this same sort of similarity logic is going to make me feel even less confident. I have two suggestions: (1) validate that what i've provided works for the single hop from a performance perspective on your production size graph because adding loops that traverse deeper will only increase the cost further and make the traversal even more unreadable. if possible, improve the performance of it as it is without "indirect" relationships. – stephen mallette Aug 10 '20 at 11:03
  • (2) if one fails, then proceeding even further down this path will make even less sense and you should consider a different approach to your similarity matching. it might mean running a separate traversal to add edges between A-C in your example and then running the optimized traversal I provided. It might mean a different model for your graph or perhaps a different division of work for how the matching occurs (i.e. Gremlin for some aspects and in-memory custom code for other parts). – stephen mallette Aug 10 '20 at 11:06
  • It sounds like you are doing some kind of entity resolution. You might consider this book which discusses graph modelling techniques, Gremlin, and has a couple chapters on that topic: https://www.oreilly.com/library/view/the-practitioners-guide/9781492044062/ – stephen mallette Aug 10 '20 at 11:09