2

Currently the site is using a relational database (MySQL) however the speed to join all the data is too long and has required caching that has lead to other issues.

The issue is how the two tables would nest into each other creating a circular reference. A simple example would be two tables, one for an ACTOR and a second for a MOVIE. The movie would have the actor and the actor would have a movie. Obviously this is easy in a relational database.

So for example, an ACTOR schema:

ACTOR1
- AGE
- BIO
- MOVIES
    - FILM1 (ties to the FILM1 document)
    - FILM2

Then the MOVIE schema:

FILM1
- RELEASE DATE
- ACTORS
    - ACTOR1 (ties back to the ACTOR document)
    - ACTOR2

Speed is the most important thing to me. I can easily add ID's to an ACTOR document in place of the full MOVIE document. However I'm back to multiple calls. Are there any features in a NoSQL database like MongoDB or DynamoDB that could solve this in a single call? Or is NoSQL just not the right choice?

Ryan Mills
  • 939
  • 1
  • 10
  • 23

1 Answers1

4

While NoSQL generally recommends denormalization of data models, it is best not to have an unbounded list in a single database entry. To model this data in DynamoDB, you should use an adjacency list for modeling the many-to-many relationship. There's no cost-effective way of modeling the data, that I know of, to allow you to get all the data you want in a single call. However, you have said that speed is most important (without giving a latency requirement), so I will try to give you an idea as to how fast you can get the data if stored in DynamoDB.

Your schemas would become something like this:

Actor {
    ActorId, <-- This is the application/database id, not the actor's actual ID
    Name,
    Age,
    Bio
}

Film {
    FilmId, <-- This is the application/database id for the film
    Title,
    Description,
    ReleaseDate
}

ActedIn {
    ActorId,
    FilmId
}

To indicate that an actor acted in a movie, you only need to perform one write (which is consistently single-digit milliseconds using DynamoDB in my experience) to add an ActedIn item to your table.

To get all the movies for an actor, you would need to query once to get all the acted in relationships, and then a batch read to get all the movies. Typical latencies for a query (in my experience) is under 10ms, depending on the network speeds and the amount of data being sent over the network. Since the ActedIn relationship is such a small object, I think you could expect an average case of 5ms for a query, if your query is originating from something that is also running in an AWS datacenter (EC2, Lambda, etc).

Getting a single item is going to be under 5 ms, and you can do that in parallel. There's also a BatchGetItems API, but I don't have any statistics for you on that.

So, is ~10ms fast enough for you?

If not, you can use DAX, which adds a caching layer to DynamoDB and promises request latency of <1ms.

What's the unmaintainable, not-cost-effective way to do this in a single call?

For every ActedIn relationship, store your data like this:

ActedIn {
    ActorId,
    ActorName,
    ActorAge,
    ActorBio,
    FilmId,
    FilmTitle,
    FilmDescription,
    FilmReleaseDate
}

You only need to make one query for any given Actor to get all of their film details, and only one query to get all the Actor details for a given film. Don't actually do this. The duplicated data means that every time you have to update the details for an Actor, you need to update it for every Film they were in, and similarly for Film details. This will be an operational nightmare.

I'm not convinced; it seems like NoSQL is terrible for this.

You should remember that NoSQL comes in many varieties (NoSQL = Not Only SQL), and so even if one NoSQL solution doesn't work for you, you shouldn't rule it out entirely. If you absolutely need this in a single call, you should consider using a Graph database (which is another type of NoSQL database).

Community
  • 1
  • 1
Matthew Pope
  • 7,212
  • 1
  • 28
  • 49