0

My team is writing an application with GAE (Java) that has led me to question the scalability of entity relationship modeling (specifically many-to-many) in object oriented databases like BigTable.

The preferred solution for modeling unowned one-to-many and many-to-many relationships in the App Engine Datastore (see Entity Relationships in JDO) seems to be list-of-keys. However, Google warns:

"There are a few limitations to implementing many-to-many relationships this way. First, you must explicitly retrieve the values on the side of the collection where the list is stored since all you have available are Key objects. Another more important one is that you want to avoid storing overly large lists of keys..."

Speaking of overly large lists of keys, if you attempt to model this way and assume that you are storing one Long for each key then with a per-entity limit of 1MB the theoretical maximum number of relationships per entity is ~130k. For a platform who's primary advantage is scalabililty, that's really not that many relationships. So now we are looking at possibly sharding entities which require more than 130k relationships.

A different approach (Relationship Model) is outlined in the article Modeling Entity Relationships as part of the Mastering the datastore series in the AppEngine developer resources. However, even here Google warns about the performance of relational models:

"However, you need to be very careful because traversing the connections of a collection will require more calls to the datastore. Use this kind of many-to-many relationship only when you really need to, and do so with care to the performance of your application."

So by now you are asking: 'Why do you need more than 130k relationships per-entity?' Well I'm glad you asked. Let's take, for example, a CMS application with say 1 million users (Hey I can dream right?!)

Users can upload content and share it with: 1. public 2. individuals 3. groups 4. any combination

Now someone logs in, and navigates to a dashboard that shows new uploads from people they are connected to in any group. This dashboard should include public content, and content shared specifically with this user or a group this user is a member of. Not too bad right? Let's dig into it.

public class Content {
  private Long id;
  private Long authorId;
  private List<Long> sharedWith; //can be individual ids or group ids
}

Now my query to get everything an id is allowed to see might look like this:

List<Long> idsThatGiveMeAccess = new ArrayList<Long>();
idsThatGiveMeAccess.add(myId);
idsThatGiveMeAccess.add(publicId); //Let's say that sharing with 0L makes it public
for (Group g : groupsImIn)
    idsThatGiveMeAccess.add(g.getId());

List<Long> authorIdsThatIWantToSee = new ArrayList<Long>();
//Add a bunch of authorIds

Query q = new Query("Content")
            .addFilter("authorId", Query.FilterOperator.IN, authorIdsThatIWantToSee)
            .addFilter("sharedWith", Query.FilterOperator.IN, idsThatGiveMeAccess);

Obviously I've already broken several rules. Namely, using two IN filters will blow up. Even a single IN filter at any size approaching the limits we are talking about would blow up. Aside from all that, let's say I want to limit and page through the results... no no! You can't do that if you use an IN filter. I can't think of any way to do this operation in a single query - which means you can't paginate it without extensive read-time processing and managing multiple cursors.

So here are the tools I can think of for doing this: denormalization, sharding, or relationship entities. However even with these concepts I don't see how it is possible to model this data in a way that could scale. Obviously it's possible. Google and others do it all the time. I just can't see how. Can anyone shed any light on how to model this or point me toward any good resources for cms-style access control based on NoSQL DB?

Linus Caldwell
  • 10,908
  • 12
  • 46
  • 58
Jeremiah
  • 861
  • 7
  • 8

1 Answers1

1

storing a list of ids as a property wont scale. Why not simply store a new object for each new relationship? (Like in sql). That object will store for your cms two properties: The id of the shared item and the user id. If its shared with 1000 users you will have 1000 of these. Querying it for a given user is trivial. Listing permissions for a given item or a list of what a user has shared with them is easy too.

Zig Mandel
  • 19,571
  • 5
  • 26
  • 36