9

I'm still learning my lessons about data modeling in bigtable/nosql and would appreciate some feedback. Would it be fair to say that I should avoid parent->child relationships in my data modeling if I frequently need to deal with the children in aggregate across parents?

As an example, let's say I'm building a blog that will be contributed to by a number of authors, and each other has posts, and each post has tags. So I could potentially set up something like this:

class Author(db.Model): 
  owner = db.UserProperty()

class Post(db.Model): 
  owner = db.ReferenceProperty(Author, 
    collection_name='posts') 
  tags = db.StringListProperty() 

As I understand this will create an entity group based on the Author parent. Does this cause inefficiency if I mostly need to query for Posts by tags which I expect to cut across multiple Authors?

I understand doing a query on list properties can be inefficient. Let's say each post has about 3 tags on average, but could go all the way up to 7. And I expect my collection of possible tags to be in the low hundreds. Is there any benefit to altering that model to something like this?

class Author(db.Model): 
  owner = db.UserProperty()

class Post(db.Model): 
  owner = db.ReferenceProperty(Author, 
    collection_name='posts') 
  tags = db.ListProperty(db.Key)

class Tag(db.Model): 
  name = db.StringProperty() 

Or would I be better off doing something like this?

class Author(db.Model): 
  owner = db.UserProperty()

class Post(db.Model): 
  owner = db.ReferenceProperty(Author, 
    collection_name='posts')

class Tag(db.Model): 
  name = db.StringProperty() 

class PostTag(db.Model): 
  post = db.ReferenceProperty(Post, 
    collection_name='posts') 
  tag = db.ReferenceProperty(Tag, 
    collection_name='tags') 

And last question... what if my most common use case will be querying for posts by multiple tags. E.g., "find all posts with tags in {'apples', 'oranges', 'cucumbers', 'bicycles'}" Is one of these approaches more appropriate for a query that looks for posts that have any of a collection of tags?

Thanks, I know that was a mouthful. :-)

Bob Ralian
  • 1,949
  • 1
  • 20
  • 29
  • None of your examples create entity groups. In the first example, you're using a ReferenceProperty, which creates a reference to the other entity - this is mutable, and doesn't imply ownership. Parent references are created by specifying the 'parent' argument to the constructor for the entity - see this page for details: http://code.google.com/appengine/docs/python/datastore/entities.html#Entity_Groups_and_Ancestor_Paths – Nick Johnson Mar 01 '11 at 02:24
  • Ah, thanks Nick. I was missing that part... thought it was the references that created the parent relationship and was missing that you needed to pass the parent to the constructor. That makes sense now. – Bob Ralian Mar 01 '11 at 22:43

2 Answers2

5

Something like the first or second approach are well suited for App Engine. Consider the following setup:

class Author(db.Model): 
  owner = db.UserProperty()

class Post(db.Model): 
  author = db.ReferenceProperty(Author, 
    collection_name='posts') 
  tags = db.StringListProperty()

class Tag(db.Model): 
  post_count = db.IntegerProperty()

If you use the string tag (case-normalized) as the Tag entity key_name, you can efficiently query for posts with a specific tag, or list the tags of a post, or fetch tag statistics:

post = Post(author=some_author, tags=['app-engine', 'google', 'python'])
post_key = post.put()
# call some method to increment post counts...
increment_tag_post_counts(post_key)

# get posts with a given tag:
matching_posts = Post.all().filter('tags =', 'google').fetch(100)
# or, two tags:
matching_posts = Post.all().filter('tags =', 'google').filter('tags =', 'python').fetch(100)

# get tag list from a post:
tag_stats = Tag.get_by_key_name(post.tags)

The third approach requires additional queries or fetches for most basic operations, and it is more difficult if you want to query for multiple tags.

Robert Kluin
  • 8,282
  • 25
  • 21
  • awesome, thanks robert. this is actually how I have it written. but I'm still new, so I wasn't sure if this was really the best way, so I appreciate you sharing your experience! – Bob Ralian Feb 28 '11 at 20:05
  • 1
    @Bob Ralian, one thing to be wary of are exploding indexes. The general concept is good; you might also find the "Relation Index" pattern useful, but since your list is very small _and_ you want the tags you don't need a separate entity. (http://www.google.com/events/io/2009/sessions/BuildingScalableComplexApps.html) – Robert Kluin Feb 28 '11 at 20:25
2

I would choose the last approach, because it allows for retrieving a list of posts directly given a tag.

The first approach basically makes it impossible to keep a canonical set of tags. In other words, the question "what tags are currently present in the system" is very expensive to answer.

The second approach fixes that problem, but as I mentioned doesn't help you to retrieve posts given a tag.

Entity groups are a bit of a mysterious beast, but suffice it to say the first approach does NOT create an entity group, and that they are only necessary for transactional database operations, and sometimes useful for optimized data reads, but are probably unneeded in a smallish application.

It should be mentioned that any approach you take will only work well in conjunction with a smart caching strategy. GAE apps LOVE caching. Get intimate with the memcache api, and learn the bulk read/write operations on memcache and the datastore.

Kenan Banks
  • 207,056
  • 34
  • 155
  • 173
  • Thanks Triptych. I'm actually not worried about the canonical issue, as I will handle that during validation before saving. Re:entity groups, the docs say "To create an entity in a group, you declare that another entity is the parent of the new entity when you create it." So I take that to mean a parent->child relationship will create an entity group if it is declared on the child at the point it is created. I understand the point of entity groups is for transactions. But do they cause latency/inefficiency for selects across entity groups? Are cross-group transactions possible? – Bob Ralian Feb 28 '11 at 16:18
  • Cross-group transactions are not possible, but if you are doing a lot of selects across entity groups, that's a soft indication that you shouldn't be using them anyway. Also, understand that your validation process will require reading every tag in every Post model in the datstore, if using the first approach. – Kenan Banks Feb 28 '11 at 16:43
  • I'll have a separate Tag model regardless. And I'll keep those in memcache. The first approach just wound't actually tie to them but would instead use them to dictate acceptable strings. It's not necessarily appropriate for Posts where I would select by tag, but would be more appropriate for something like Reader preferences, where I just need to pull the list of tags. – Bob Ralian Feb 28 '11 at 17:03