Decision Tree query in Gremlin

Question

I have simplified a decision graph. It starts with begin vertex and ends with decision. My aim is to calculate the sum of a score (score associated with vertex) while traveling different paths (to reach decision vertex).

The input to Graph is JSON.

Edges between vertices contain variables and values which can be checked from the input JSON.

Example input JSON :{ "age":45,"income_source":"job" }

Output is the sum of the scores [10 + 15 + 22] = 47

In Neo4j a Cypher query allows you to pass JSON input as query parameters but I do not know how this can be done in Gremlin.

Graph link : https://gremlify.com/nwgxqs5h7zh/

g.addV('begin').as('beg').
addV('decision').property('score',0).property('decision_code',"minor").as('dec0').

addV('age').property('score',10).as('age10').
addV('age').property('score',20).as('age20').

addV('salary').property('score',15).as('sal15').
addV('salary').property('score',25).as('sal25').

addV('salary').property('score',18).as('sal18').
addV('salary').property('score',30).as('sal30').

addV('decision').property('score',22).property('decision_code',"decision_22").as('dec22').
addV('decision').property('score',45).property('decision_code',"decision_45").as('dec45').

addV('decision').property('score',18).property('decision_code',"decision_18").as('dec18').
addV('decision').property('score',30).property('decision_code',"decision_30").as('dec30').


addE('relation').property('var',"age").property('val',"").property('min',"10").property('max',"18").from('beg').to('dec0').
addE('relation').property('var',"age").property('val',"").property('min',"19").property('max',"48").from('beg').to('age10').
addE('relation').property('var',"age").property('val',"").property('min',"49").property('max',"80").from('beg').to('age20').


addE('relation').property('var',"income_source").property('val',"job").property('min',"-1").property('max',"-1").from('age10').to('sal15').
addE('relation').property('var',"income_source").property('val',"buisness").property('min',"-1").property('max',"-1").from('age10').to('sal25').

addE('relation').property('var',"income_source").property('val',"job").property('min',"-1").property('max',"-1").from('age20').to('sal18').
addE('relation').property('var',"income_source").property('val',"buisness").property('min',"-1").property('max',"-1").from('age20').to('sal30').

addE('relation').property('var',"").property('val',"").property('min',"-1").property('max',"-1").from('sal15').to('dec22').
addE('relation').property('var',"").property('val',"").property('min',"-1").property('max',"-1").from('sal25').to('dec45').
addE('relation').property('var',"").property('val',"").property('min',"-1").property('max',"-1").from('sal18').to('dec18').
addE('relation').property('var',"").property('val',"").property('min',"-1").property('max',"-1").from('sal30').to('dec30')

There is an issue with lt, gt, inside, between predicate. It only accepts number not any thing which evaluates to number.

g.inject(['val1':10,'val2':15]).as('data').V().
where(select('data').select('val1').is(lt(select('data').values('val2'))))

Above query fails Cannot compare '10' (Integer) and '[SelectOneStep(last,data), PropertiesStep([val2],value)]'... Due to this issue below query also fails.

g.withSack(0).inject(['age':45,'source':'job']).as('data').
V().hasLabel('begin').
    repeat(outE().as('e').where(select('data').select(select('e').values('var')).is(eq(select('e').values('val')).or(inside(select('e').values('min'),select('e').values('max'))))).inV().sack(sum).by('score')).
    until(hasLabel('decision')).project('final_score','path').by(sack()).by(path())

Please let me know if this problem can be modeled in different way to achieve same output score

Thank you for your time.

Tinkerpop throws error while comparing different data types (string and integer) when used in " lt, gt, inside, between " predicate . It should evaluates to false in such cases like Neo4j does. Not sure how AWS Neptune, Janus graph, Cosmos DB or others behave for mismatch data type comparison — gremlin, Nov 22 '21 at 12:32
With Apache TinkerPop enabled graphs it is generally not viewed as good practice to have properties with the same key name but values of different types such as integer and string. Even returning `false` in such cases is not ideal (in fact it is really incorrect) as, in for example Groovy, `'a' < 1` is `false` but `'a' > 1` is `true`. It would be better to normalize the data in the graph. — Kelvin Lawrence, Nov 22 '21 at 14:22
True, it make sense to have same key name to have same data type. — gremlin, Nov 22 '21 at 16:02

score 0 · Answer 1 · answered Nov 17 '21 at 19:24

I have converted input JSON as a List. The ordering of element in this list is important. It decides, the level at which the traversal will compare which element from the list.

 g.withSack(0).
  inject(["age", 45, "income_source", "job"]).as("input").

# initialized sack and input List

  V().hasLabel("begin").
  outE().as('a').local(and(
          select("input").unfold().range(0, 1).as("temp").
              select("a").values("var").where(eq("temp")), # FILTER property "var"

          select("input").unfold().range(1, 2).as("temp").
              select("a").values("max").where(gte("temp")).
              select("a").values("min").where(lte("temp")))). # FILTER by age from input.

  inV().sack(sum).by("score").
  outE().as("b").local(and(
      select("input").unfold().range(2, 3).as("temp").
          select("b").values("var").where(eq("temp")), # FILTER property "var"

      select("input").unfold().range(3, 4).as("temp").
          select("b").values("val").where(eq("temp")))). # FILTER property val 

  inV().sack(sum).by("score").
  out().sack(sum).by("score").
  sack()

This works fine but there is a problem. When graph is huge (40-50 vertex) with json having 30-40 keys keeping ordering in input could be really difficult moreover any change in graph (modifying path between vertex) requires changes in query as well. — gremlin, Nov 18 '21 at 02:10

Kelvin Lawrence · Answer 2 · 2021-11-17T20:01:28.797

You can inject a map into a Gremlin query which essentially has the same shape as your JSON document. The basic building blocks for the first part of the query will look something like this, which I tested using your data and TinkerGraph.

gremlin> g.inject(['age':45,'source':'job']).as('data').
......1>   V().hasLabel('begin').
......2>   outE().as('e1').
......3>   where(gte('e1')).
......4>     by(select('data').select('age')).
......5>     by('min').
......6>   where(lte('e1')).
......7>     by(select('data').select('age')).
......8>     by('max').
......9>   valueMap()  

==>[min:19,max:48,var:age]

The next step is to find the edges that have the job tag.

gremlin> g.inject(['age':45,'source':'job']).as('data').
......1>   V().hasLabel('begin').
......2>   outE().as('e1').
......3>   where(gte('e1')).
......4>     by(select('data').select('age')).
......5>     by('min').
......6>   where(lte('e1')).
......7>     by(select('data').select('age')).
......8>     by('max').
......9>   inV().
.....10>   outE().as('e2').
.....11>   where(eq('e2')).
.....12>     by(select('data').select('source')).
.....13>     by('val').valueMap()

==>[val:job,var:income_source]

All we need to do now is traverse to the final node and calculate the sum.

gremlin> g.withSack(0).
......1>   inject(['age':45,'source':'job']).as('data').
......2>   V().hasLabel('begin').
......3>   outE().as('e1').
......4>   where(gte('e1')).
......5>     by(select('data').select('age')).
......6>     by('min').
......7>   where(lte('e1')).
......8>     by(select('data').select('age')).
......9>     by('max').
.....10>   inV().
.....11>   sack(sum).
.....12>     by('score').
.....13>   outE().as('e2').
.....14>   where(eq('e2')).
.....15>     by(select('data').select('source')).
.....16>     by('val').
.....17>   inV().
.....18>   sack(sum).
.....19>     by('score').
.....20>   out().
.....21>   sack(sum).
.....22>     by('score').
.....23>   sack() 

==>47

This is having some issue in Gremlify https://gremlify.com/k0auxbd8rim. nevertheless the query to calculate the sum could get really long for big json (~50 keys). I think this will not be simpler to get it done in gremlin as it could be in neo4j. In neo4j it will be something like ` WITH json_object as object MATCH p = (:beg)-[*]->(final:decision) WHERE ALL (r in relationships(p) WHERE (r.val=object[r.var]) OR (r.min<=object[r.var]<=r.max) OR (object[r.var] in r.val)) RETURN final ` . This will return the last decision node but is much simpler and intuitive. — gremlin, Nov 18 '21 at 03:06
It would be helpful if you could add that information to the original question. You should be able to add a `repeat` step to the above example to traverse the tree to an arbitrary depth. — Kelvin Lawrence, Nov 20 '21 at 14:24
With your query i could build something similar but stuck with Tinkerpop limitation, updated it in the question about it. — gremlin, Nov 22 '21 at 12:36

Decision Tree query in Gremlin

2 Answers2