1

We have a Cosmos graph database like below. ie. A,B,C,... are nodes/vertices and edges are as shown by the arrows.

enter image description here

Each node/vertex represents a value in SQL table. Process and the requirement are as follows.

  1. User modifies node A value in SQL table
  2. Gremlin query passes A into the Graph
  3. Graph returns the following vertices in the below listed order
  4. C# app calculates the values of D,K,M,P nodes in the order and updates SQL table
  • D = A+B+C
  • K = F+E+D
  • M = J+K
  • P = L+M+N+O

I tried the following query and it has over 3000 RUs which is very costly.

g.V("A").emit().repeat(__.in('depends')).until(__.inE().count().is(0))

We need some help to optimise the query. thanks

UPDATE ===========

OK, we can rebuild the graph in a single partition to reduce the RUs but we have a scenario where multiple nodes are affected, highlighted in red in the below picture, on the way up starting from A.

Can someone help with a query to get the results in A, D, K, O, M, P order please? Logic to the query is all the child nodes should be listed before their parents

g.addV('ddn').property('pk', 'pk').property(id, 'A').property('formula', 'A').
addV('ddn').property('pk', 'pk').property(id, 'B').property('formula', 'B').
addV('ddn').property('pk', 'pk').property(id, 'C').property('formula', 'C').
addV('ddn').property('pk', 'pk').property(id, 'D').property('formula', 'A+B+C').property('requires', "'A','B','C'").
addV('ddn').property('pk', 'pk').property(id, 'E').property('formula', 'E').
addV('ddn').property('pk', 'pk').property(id, 'F').property('formula', 'E').
addV('ddn').property('pk', 'pk').property(id, 'G').property('formula', 'H+I').property('requires', "'H','I'").
addV('ddn').property('pk', 'pk').property(id, 'H').property('formula', 'H').
addV('ddn').property('pk', 'pk').property(id, 'I').property('formula', 'I').
addV('ddn').property('pk', 'pk').property(id, 'J').property('formula', 'F+G').property('requires', "'F','G'").
addV('ddn').property('pk', 'pk').property(id, 'K').property('formula', 'D+E+F').property('requires', "'D','E','F'").
addV('ddn').property('pk', 'pk').property(id, 'L').property('formula', 'L').
addV('ddn').property('pk', 'pk').property(id, 'M').property('formula', 'J+K').
addV('ddn').property('pk', 'pk').property(id, 'N').property('formula', 'N').
addV('ddn').property('pk', 'pk').property(id, 'O').property('formula', 'A+K').property('requires', "'A','K'").
addV('ddn').property('pk', 'pk').property(id, 'P').property('formula', 'L+M+N+O').property('requires', "'L','M','N','O'").
V('D').addE('needs').to(V('A')).
V('D').addE('needs').to(V('B')).
V('D').addE('needs').to(V('C')).
V('G').addE('needs').to(V('H')).
V('G').addE('needs').to(V('I')).
V('K').addE('needs').to(V('D')).
V('K').addE('needs').to(V('E')).
V('K').addE('needs').to(V('F')).
V('J').addE('needs').to(V('F')).
V('J').addE('needs').to(V('G')).
V('O').addE('needs').to(V('A')).
V('O').addE('needs').to(V('K')).
V('M').addE('needs').to(V('J')).
V('M').addE('needs').to(V('K')).
V('P').addE('needs').to(V('L')).
V('P').addE('needs').to(V('M')).
V('P').addE('needs').to(V('N')).
V('P').addE('needs').to(V('O'))

enter image description here

stephen mallette
  • 45,298
  • 5
  • 67
  • 135
Kaf
  • 33,101
  • 7
  • 58
  • 78
  • i'm assumming A, D, K,M, P have different PartitionKeys? In that case, not sure there's much you can do since this will inevitably generate a Cross-partition Query in CosmosDB, and those are always expensive – AlexDrenea Jul 15 '20 at 11:28
  • @AlexDrenea A, B, C, D...K,M, P are partition keys. ie. {pk='A', label='node'}, {pk='B', lablel='node'}. – Kaf Jul 15 '20 at 11:35
  • yup, I thought so... Your Gremlin query looks good, I'm not sure there's much to do in order to optimize it, other than revisiting the data model to better suit this query pattern, – AlexDrenea Jul 15 '20 at 13:14
  • @AlexDrenea any suggestions? We are thinking about creating a single hot partition using a common partition key with different labels. ie {pk='pk', label='A'}, {pk='pk', lablel='B'}. What do you think? – Kaf Jul 15 '20 at 14:01
  • hot partitions are going to be really bad for writes though (time wise, not necesarily RUs). There's no harm if giving it a try, but keep in mind the 10GB partition limit. Is there any other logical way of grouping the nodes so that they you hit a bit of a middle ground? If your final operation for (A) ends up cross partititon, you could issue 2 separate calls to the DB, one for each partition and sum them up on the client ? – AlexDrenea Jul 15 '20 at 20:38
  • @AlexDrenea, we can rebuild the graph in a single partition within the 10Gb limit and thinking of a middle ground as well. Can you check the update and help with the ordered query? – Kaf Jul 16 '20 at 10:33
  • What is the logic to the order of "A, D, K, O, M, P"? Also, while the pictures are nice a including a Gremlin script with your sample data would be helpful to giving you a working traversal - here is an example https://stackoverflow.com/questions/51388315/gremlin-choose-one-item-at-random – stephen mallette Jul 16 '20 at 11:13
  • @stephenmallette Logic to the query is all the child nodes should be listed before their parents. I've added the script per example – Kaf Jul 16 '20 at 18:17
  • just so that i am clear - O must come after A, D and K even though A is a direct parent of O because there is a longer path through D and K making A also a great-grand parent of O? and what about the order between O and M from K - could "A,D,K,M,O,P" also have been acceptable? perhaps I'm over-complicating what you are asking for... – stephen mallette Jul 16 '20 at 21:15
  • @stephenmallette spot on with the logic. Both A,D,K,(M,O),P and A,D,K,(O,M),P orders are acceptable because M, O are not inter dependents – Kaf Jul 17 '20 at 08:27

1 Answers1

2

I think the answer boils down to being able to sort the vertices traversed by their path length.

gremlin> g.V("A").
......1>   emit().repeat(__.in('needs')).path().
......2>   group().
......3>     by(tail(local)).
......4>     by(count(local).fold()).
......5>   order(local).
......6>     by(select(values).tail(local)).
......7>   select(keys)
==>[v[A],v[D],v[K],v[M],v[O],v[P]]

I group() by the last element in the path() and transform each path in the group to its length with count(local). That allows me to order() the results by the longest path for each vertex.

Note that I don't think you need until(__.inE().count().is(0)) because you're just traversing to path exhaustion in either case. Also, take care with __.inE().count().is(0) as you end up counting all the edges just to detect a count of zero. Most graphs should optimize that to just until(inE()), but it's always best to be explicit in my opinion. That said, you need to be sure of your data structures when using repeat() - it takes just one edge of bad data to send your traversal into an infinity of traversing. Consider some kind of upper bound to your repeat() that makes sense for your data so that the loop will terminate at some point.

Here is an alternative which actually might be better since it doesn't bother to hold all the counts in the Map after the group():

gremlin> g.V("A").
......1>   emit().repeat(__.in('needs')).path().
......2>   group().
......3>     by(tail(local)).
......4>     by(count(local).order(local).tail(local)).
......5>   order(local).
......6>     by(values).
......7>   select(keys)
==>[v[A],v[D],v[K],v[M],v[O],v[P]]
stephen mallette
  • 45,298
  • 5
  • 67
  • 135
  • Thanks. It gives me the following error on Azure Cosmos. `"Script eval error: \r\n\nActivityId : 7dc096a2-f3db-49a4-b053-f20f582ccb66\nExceptionType : GraphRuntimeException\nExceptionMessage :\r\n\tGremlin Query Execution Error: Cannot compare object of type: CompositeField.\nSource : Microsoft.Azure.Cosmos.Gremlin.Core\n\tGremlinRequestId : 7dc096a2-f3db-49a4-b053-f20f582ccb66\n\tContext : graphcompute\n\tScope : graphcomp-execquery\n\tGraphInterOpStatusCode : GraphRuntimeError\n\tHResult : 0x80131500\r\n"` Works up to `...by(count(local).fold())` not beyond `order()` – Kaf Jul 17 '20 at 13:25
  • 1
    cosmosdb always seems to have an incompatibility somewhere. i've offered an alternative in my answer – stephen mallette Jul 17 '20 at 15:26