MongoDB data structure with large number internal documents

Question

I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?

I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.

As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like

| USER |
--------
|ID
|Name
|Etc.

|TWEET__|
---------
|ID
|UserID
|Etc

It seems like the logical schema in Mongo would be

User
|-Tweet (0..3000)
  |-Entities
    |-Hashtags (0..10+)
    |-urls (0..5)
    |-user_mentions (0..12)
  |-GeoData (0..20)
|-somegroupID

but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?

score 1 · Answer 1 · answered Feb 17 '12 at 11:24

1

You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.

Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.

I mean a design for a tweet document as:

{
    'hashtags': [ '#foo', '#bar' ],
    'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
    'user_mentions' : [ 'queen_uk' ],
    'geodata': { ... },
    'userid': 'derickr',
    'somegroupid' : 40
}

And then for a user collection, the documents could look like:

{
    'userid' : 'derickr',
    'realname' : Derick Rethans',
    ...
}

answered Feb 17 '12 at 11:24

Derick

35,169
5
76
99

I did mention "I would like to run analysis on tweets belonging to users with similar somegroupID". Would it not be breaking the insertion pattern to add a "somegroupID" element to every tweet that want to be tracked in that group? If I do it that way then doing that mapReduce / analysis on that subset becomes evident. – Lloyd Feb 18 '12 at 05:52
I don't understand what you mean by "insertion pattern" and you haven't mentioned what kind of analysis you'd like to do. In any case, you probably would want to avoid M/R if you can and do the analysis with normal queries. – Derick Feb 22 '12 at 09:56
I meant best practices for updating data. If a user is added to another user's "somegroupID", then I would have to insert that somegroupID across all the tweet documents associated with that user. Seems like a high overhead for insert. Then with regards to doing analysis with normal queries, does Mongo do things like count() on a GROUP BY? To get the number of tweets per day or something like that? – Lloyd Feb 22 '12 at 18:04

score 1 · Accepted Answer · answered Feb 23 '12 at 17:39

All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ

Chris Winslett @ MongoHQ

You will find this video interesting:

http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale

Essentially, in one document, store one days of tweets for one person. The reasoning:

Querying typically consists of days and users

Therefore, you can have the following index:

{user_id: 1, date: 1} # Date needs to be last because you will range and sort on the date

Have fun!

Chris MongoHQ

I think it makes the most sense to implement the following:

user

{ user_id: 123123,
  screen_name: 'cledwyn',
  misc_bits: {...},
  groups: [123123_group_tall_people, 123123_group_techies, ],
  groups_in: [123123_group_tall_people]
}

tweet

{ tweet_id: 98798798798987987987987,
  user_id: 123123,
  tweet_date: 20120220,
  text: 'MongoDB is pretty sweet',
  misc_bits: {...},
  groups_in: [123123_group_tall_people]
}

MongoDB data structure with large number internal documents

2 Answers2