3

Going off an example in the accepted answer here:

For a blogging system, "Posts should be a collection. post author might be a separate collection, or simply a field within posts if only an email address. comments should be embedded objects within a post for performance."

If this is the case, does that mean that every time my app displays a blog post, I'm loading every single comment that was ever made on that post? What if there are 3,729 comments? Wouldn't this brutalize the database connection, SQL or NoSQL? Also there's the obvious scenario in which when I load a blog post, I want to show only the first 10 comments initially.

Community
  • 1
  • 1
S. Valmont
  • 941
  • 2
  • 11
  • 25

2 Answers2

10

Document databases are not relational databases. You CANNOT first build the database model and then later on decide on various interesting ways of querying it. Instead, you should first determine what access patterns you want to support, and then design the document schemas accordingly.

So in order to answer your question, what we really need to know is how you intend to use the data. Displaying comments associated with a post is a distinctly different scenario than displaying all comments from a particular author. Each one of those requirements will dictate a different design, as will supporting them both.

This in itself may be useful information to you (?), but I suspect you want more concrete answers :) So please add some additional details on your intended usage.

Adding more info: There are a few "do" and "don'ts" when deciding on a strategy:

  1. DO: Optimize for the common use-cases. There is often a 20/80 breakdown where 20% of the UX drives 80% of the load - the homepage/landing page is a classic example. First priority is to make sure that these are as efficient as possible. Make sure that your data model allows either A) loading those in either a single IO request or B) is cache-friendly

  2. DONT: don't fall into the dreaded "N+1" trap. This pattern occurs when you data model forces you to make N calls in order to load N entities, often preceded by an additional call to get the list of the N IDs. This is a killer, especially together with #3...

  3. DO: Always cap (via the UX) the amount of data which you are willing to fetch. If the user has 3729 comments you obviously aren't going to fetch them all at once. Even it it was feasible from a database perspective, the user experience would be horrible. Thats why search engines use the "next 20 results" paradigm. So you can (for example) align the database structure to the UX and save the comments in blocks of 20. Then each page refresh involves a single DB get.

  4. DO: Balance the Read and Write requirements. Some types of systems are read-heavy and you can assume that for each write there will be many reads (StackOverflow is a good example). So there it makes sense to make writes more expensive in order to gain benefits in read performance. For example, data denormalization and duplication. Other systems are evenly balanced or even write heavy and require other approaches

  5. DO: Use the dimension of TIME to your advantage. Twitter is a classic example: 99.99% of tweets will never be accessed after the first hour/day/week/whatever. That opens all kinds of interesting optimization possibilities in the your data schema.

This is just the tip of the iceberg. I suggest reading up a little on column-based NoSQL systems (such as Cassandra)

Addys
  • 2,461
  • 15
  • 24
  • Thanks for the advice. What if we wanted both comments-by-post and comments-by-author? In real-time so that MapReduce isn't good enough. Is this where redundancy, a staple of the NoSQL diet, comes into play? In other words, embed a copy of the comment in its entirety in both post and author? Or is it better to "go relational" and keep a manual reference in one of the two entities (e.g., a combo of post ID and comment ID in author)? – S. Valmont Apr 23 '13 at 23:08
  • Option three: break comments out into their own collection, because they can appear either by post or by author, and aren't rigidly coupled with one or the other. – S. Valmont Apr 23 '13 at 23:32
1

Not sure if this answers you question, but anyhow you can throttle the amount of blog comments in two ways:

  • Load only the last 10 , or range of blog comments using $slice operator

db.blogs.find( {_id : someValue}, { comments: { $slice: -10 } } )

will return last 10 comments

db.blogs.find( {_id : someValue}, { comments: { $slice: [-10, 10] } } )

will return next 10 comments

  • Use capped array to save only the last n blog posts using capped arrays
Ori Dar
  • 18,687
  • 5
  • 58
  • 72
  • Thanks. What if I wanted to display comments by a particular author (i.e., filter on a property, or properties, of the comment array)? Would mongoDB "table scan" each and every comment array in my blog document? – S. Valmont Apr 23 '13 at 20:45
  • 1
    You can index an array field (multikey index), specifically `comments.author` You can find such documents using [$elemMatch](http://docs.mongodb.org/manual/reference/operator/elemMatch/) – Ori Dar Apr 23 '13 at 22:06