Getting user info using JOINs in Cosmos DB

Question

The following is my Company object that I store in Cosmos DB. I have all the essential information about employees in Employees property. I also have a Departments property that both defines departments as well as its members.

{
  "id": "company-a",
  "name": "Company A",
  "employees": [
    {
      "id": "john-smith",
      "name": "John Smith",
      "email": "john.smith@example.com"
    },
    {
      "id": "jane-doe",
      "name": "Jane Doe",
      "email": "jane.doe@example.com"
    },
    {
      "id": "brian-green",
      "name": "Brian Green",
      "email": "brian.green@example.com"
    }
  ],
  "departments": [
    {
      "id": "it",
      "name": "IT Department",
      "members": [
        {
          "id": "john-smith",
          "name": "John Smith",
          "isDepartmentHead": true
        },
        {
          "id": "brian-green",
          "name": "Brian Green",
          "isDepartmentHead": false
        }
      ]
    },
    {
      "id": "hr",
      "name": "HR Department",
      "members": [
        {
          "id": "jane-doe",
          "name": "Jane Doe",
          "isDepartmentHead": true
        }
      ]
    }
  ]
}

I'm trying to return a list of a particular department, including the employee's email which will come from employees property.

Here's what I did but this is including all employees in the output:

SELECT dm.id, dm.name, e.email, em.isDepartmentHead
FROM Companies c
   JOIN d IN c.departments
   JOIN dm IN d.members
   JOIN e IN c.Employees
WHERE c.id = "company-a" AND d.id = "hr"

The correct output would be:

[
   {
      "id": "jane-doe",
      "name": "Jane Doe",
      "email": "jane.doe@example.com",
      "isDepartmentHead": true
   }
]

How do I form my SQL statement to get all members of a department AND include employees' email addresses?

score -1 · Answer 1 · answered Jul 31 '22 at 16:08

I'm pretty sure you cannot write a query like this. You are trying to correlate data twice in the same query across two arrays which I don't think is possible. (at least I've never been successful doing this).

Even if this was possible though, there are other issues with your data model. This data model will not scale. You also need to avoid unbounded or very large arrays within documents (e.g. employees and departments). You also do not want to store unrelated data in the same document. Objective here is to model data for high concurrency operations in the way you use it. This reduces both latency and cost.

There are many ways in which you can remodel this data. But if this is a very small data set, you could probably do something like this below with a partition key of companyId (assuming that you always query within a single company). This will store all employees for one company in the same logical partition which can store up to 20GB of data. I would also model this such that one document stores data specific to the company itself (address, phone number, number of employess, etc) with the id and companyId having the same value. This lets you do things like store materialized aggregates like # of employees and update it in a transaction. Also, since this approach mixes different types of entities (a bonus for NoSQL database, you need a discriminator property that allows you to filter for specific entities within the container so you can deserialize them directly into your model classes.

Here is a data model you could try (please note, you need to determine if this works for you by scaling it up to the amount of data you believe you will need to store. You also need to test and measure the RU/s cost for the CRUD operations you will execute with high concurrency).

Example company document:

{
    "id": "aaaaa",
    "companyId": "aaaaa",
    "companyName": "Company A",
    "type": "company",
    "numberOfEmployees: 3,
    "addresses": [
        {
            "address1": "123 Main Street",
            "address2": "",
            "city": "Los Angeles",
            "state": "California",
            "zip": "92568"
        }
    ]
}

Then an employee document like this:

{
    "id": "jane-doe",
    "companyId": "aaaaa",
    "type": "employee",
    "employeeId": "jane-doe",
    "employeeName": "Jane Doe",
    "employeeEmail": "jane.doe@example.com",
    "departmentId": "hr",
    "departmentName": "HR Department",
    "isDepartmentHead": true
}

Then last, here's the query to get the data you need.

SELECT 
    c.employeeId, 
    c.employeeName, 
    c.employeeEmail, 
    c.IsDepartmentHead
FROM c
WHERE 
    c.companyId = "company-a" AND
    c.type = "employee" AND
    c.departmentId = "hr"

First, thank you for your detailed response. I read your answer a few times to digest it. I do agree with your concern about scalability but I fail to see your point in other areas. For example, in the model you're suggesting, I hardly see any benefit to using NoSQL, other than including company address within Company document and probably storing different document types in the same collection. If that's the model to use, why not opt in for a standard relational database? Maybe I'm not seeing it yet but I don't think I have unrelated data in my document model. — Sam, Jul 31 '22 at 21:35
If by unrelated, you mean departments and employees, I'd agree with you if I were limiting all the data pertaining to employees and departments to what's in this document. My general approach is to utilize multiple database types i.e. document, table, relational, etc. I find that painting a high level picture through data from a document works nicely. As I want to drill down further, I either end up going into other documents or other databases. Yes, maintaining data integrity becomes a challenge but if one doesn't want to worry about that, then we'd have to stick with relational databases. — Sam, Jul 31 '22 at 21:39
You can certainly do this with a relational database. Generally, NoSQL databases are used when you either need unlimited scale, flexible schemas, very low latency with high concurrency or higher availability. It doesn't mean you have to need to those things to use it, but NoSQL databases tend to do these better than a relational database. — Mark Brown, Jul 31 '22 at 21:53
You can maintain relationships using NoSQL. In Cosmos DB you'd use Change Feed to do it. btw, if you're new to Cosmos DB check out the video and links in this answer. Will help you better understand "why" and "how" for this type of database. https://stackoverflow.com/questions/62512550/how-to-query-from-two-containers-in-cosmos-db-sql-api/62515487#62515487 — Mark Brown, Jul 31 '22 at 21:55

Getting user info using JOINs in Cosmos DB

1 Answers1