16

We are planning on using MongoDB to store large amounts of analytics data such as views and clicks. I'm unsure on the best way to structure the documents within MongoDB to aid querying and reduce database size.

We need to record actions agains a pagename, client and the type of action. Ideally we need stats which go down the the year/month/day/hour level, we don't need or care about views per second or minute. While this document structure looks ok, I'm aware 100 vistors would generate a 100 new documents.

{ 
  "_id" : ObjectId( "4dabdef81a34961506040000" ),
  "pagename" : "Hello",
  "action" : "view",
  "client" : "client-name",
  "time" : Date( "Mon Apr 18 07:49:28 2011" )
}

Is there best practice way of doing this, either using $inc or Capped Collections?

ganchito55
  • 3,559
  • 4
  • 25
  • 46
Tom
  • 33,626
  • 31
  • 85
  • 109

2 Answers2

16

Updated answer

Hacked together in the mongo shell:

use pagestats;

// a little helper function
var pagePerHour = function(pagename) {
    d = new Date();
    return {
        page : pagename,
        year: d.getUTCFullYear(),
        month: d.getUTCMonth(),
        day : d.getUTCDate(),
        hour: d.getUTCHours(),
    }
}

// a pageview happened
db.pagestats.update(
    pagePerHour('Hello'),
    { $inc : { views : 1 }},
    true ); //we want to upsert

// somebody tweeted our page twice!
db.pagestats.update(
    pagePerHour('Hello'),
    { $inc : { tweets : 2 }},
    true ); //we want to upsert

db.pagestats.find();
// { "_id" : ObjectId("4dafe88a02662f38b4a20193"),
//   "year" : 2011, "day" : 21, "hour" : 8, "month" : 3,
//   "page" : "Hello",
//   "tweets" : 2, "views" : 1 }

// 24 hour summary 'Hello' on 2011-4-21
for(i = 0; i < 24; i++) {
    //careful: days (1-31), month (0-11) and hours (0-23)
    stats = db.pagestats.findOne({ page: 'Hello', year: 2011, month: 3, day : 21, hour : i})
    if(stats) {
        print(i + ': ' + stats.views + ' views')
    } else {
        print(i + ': no hits')
    };
}

Depending on which aspects you want to track you might consider adding more collections (e.g. a collection for user centric tracking). Hope that helps.

See also

Blogpost about Analytics Data

Matt
  • 17,290
  • 7
  • 57
  • 71
  • Interesting, what would the find() syntax look like if I wanted to display a count of views for 'Hello' for each hour over the past day? – Tom Apr 21 '11 at 06:21
  • .. then this solution would not be exactly ideal. But hang on, I'll post an update. – Matt Apr 21 '11 at 07:07
  • In the meantime you might want to have a look at http://cookbook.mongodb.org/patterns/unique_items_map_reduce/ – Matt Apr 21 '11 at 07:13
  • 1
    One last thing before I shut up: MongoDB will give you the speed and flexibility to experiment with diffrent approaches. Don't think too much, hack away, see if it fits your needs and change it if it does not :) – Matt Apr 21 '11 at 08:38
  • Be very careful with querying on compound indexes (that would be required here): "If the first key of the index is present in the query, that index may be selected by the query optimizer. If the first key is not present in the query, the index will only be used if hinted explicitly. While indexes can be used in many cases where an arbitrary subset of indexed fields are present in the query, as a general rule the optimal indexes for a given query are those in which queried fields precede any non queried fields." http://www.mongodb.org/display/DOCS/Indexes#Indexes-CompoundKeysIndexes – Lucas Zamboulis Apr 22 '11 at 14:41
1

I wouldn't worry too much about space, Mongo can scale pretty much infinitely in that regard, adding more space would be reasonably cheap.

One thing to be aware of is the fact that if you keep updating a document its size will grow, which means Mongo will eventually need to find a new place for it in the index. If you have a lot of documents being updated and increasing in size Mongo will need to copy these documents around a lot, this can slow stuff down significantly. Of course this all depends on how much traffic you're expecting.

Based on my experience, go with a simple document format where you don't need to update the documents, it might complicate your querying later on, but you can use map/reduce to get whatever information you want regardless of your document structure (map reduce is very flexible given enough experience you can do anything).

skorks
  • 4,376
  • 18
  • 31