Optimize the Gremlin Query running on Cosmos DB

Question

I am creating a PoC using Azure Cosmos DB for Apache Gremlin to model and query/display the Organization Chart. Given a node, the service needs to traverse the graph and pull all children recursively so the UI can display the Organization Chart. For each traversed node, I need to get few attributes e.g. Name, ImageUrl, Direct Manager etc. I was able to put together a query that gives me the desired data but it seems to be very inefficient. It works but as I go up the hierarchy, it starts to timeout. Currently, I don't even have a very large dataset, just about a sample organization with ~500 employees and about 5 levels in handful of paths. The UI component requires the data to be flat i.e. id, name, ..., parentId.

Here is how the model looks like

Here is the query that is being used

g.V('rootUserId')
    .emit()
    .repeat(out('manages'))
    .until(__.not(outE('manages')))
    .path()
    .by(project('orgUser', 'reportsTo')
        .by(valueMap(true))
        .by(out('reportsTo')
            .valueMap(true)
            .dedup()
            .fold()))
    .unfold()
    .dedup()

How can this be optimized? Needless to say, I don't expect this to execute in few seconds. Since this data is slow changing, the queried data would be cached but I need to be able to query it given a root user which could be at level 2 or level 10 (based on UI selection). CosmosDB has a hard limit on execution time (30 seconds as per documentation but I am observing 60 seconds). Eventually, I would like this to scale to a few thousands employees organization. If this model and/or querying method is not the right approach, what is the recommended approach?

Do you really need the `ReportsTo` edges as the `Manages` edges can be traversed in either direction? I'm not an expert on CosmosDB, but from a Gremlin perspective, you could just return the `path` and stitch the results together in the application. Doing an `out` traversal inside a `by` modulator for a `path` is a little unusual. Does CosmosDB support the `tree` step? If it does you can simplify this query a lot as you are essentially looking to return the populated tree for a given hierarchy. — Kelvin Lawrence, Apr 18 '23 at 02:10
Thanks @KelvinLawrence. This query is pieced together by reading several answers, mostly yours :) As you can tell, I am learning this as I go. I agree "reportsTo" isn't really needed. It is fine to do some of the work in application if needed. Returning path results in nested objects which is not easy to handle in the application, so it needs to flatten to a node and a parent node i.e. be consistent across the nodes and not change based on depth of the node. The tree step seems to be supported by CosmosDB. — amit_g, Apr 18 '23 at 15:20

score 0 · Accepted Answer · answered Jun 06 '23 at 19:56

Simplified the data model by storing the "managerId" in each of the employee node thus removing the need to traverse back to manager's node during traversal. With that the query also got reduced to

g.V('{rootUserId}')
    .emit()
    .repeat(out('manages'))
    .until(__.not(outE('manages')))
    .valueMap(true)
    .dedup()

This executes within 4-5 seconds for about ~3000 nodes dataset.

Optimize the Gremlin Query running on Cosmos DB

1 Answers1