3

Scenario:

10.000.000 record/day

Records: Visitor, day of visit, cluster (Where do we see it), metadata

What we want to know with this information:

  1. Unique visitor on one or more clusters for a given range of dates.
  2. Unique Visitors by day
  3. Grouping metadata for a given range (Platform, browser, etc)

The model i stick with in order to easily query this information is:

{
   VisitorId:1, 
ClusterVisit: [
                {clusterId:1, dates:[date1, date2]},
                {clusterId:2, dates:[date1, date3]}
              ]
}

Index:

  1. by VisitorId (to ensure Uniqueness)
  2. by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
  3. by IdUser-ClusterVisit.IdCluster (for updating)

I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.

Importing: First we search for a combination of VisitorId - ClusterId and we addToSet the date.

Second: If first doesn't match, we upsert:

    $addToSet: {VisitorId:1, 
        ClusterVisit: [{clusterId:1, dates:[date1]}]
    }

With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.

Problems: totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date. Difficult to maintain (unset dates mostly)

i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.

I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.

I have a x1.large instance on AWS RAID 10 with 10 disks

Jasonw
  • 5,054
  • 7
  • 43
  • 48
Nicolas Alejo
  • 81
  • 2
  • 7

1 Answers1

2

Arrays are expensive on large collections: mapreduce, aggregate...

Try .explain(): MongoDB 'count()' is very slow. How do we refine/work around with it?

Add explicit hints for index: Simple MongoDB query very slow although index is set

A full heap?: Insert performance of node-mongodb-native

The end of memory space for collection: How to improve performance of update() and save() in MongoDB?

Special read clustering: http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/

Global write lock?: mongodb bad performance

Slow logs performance track: Track MongoDB performance?

Rotate your logs: Does logging output to an output file affect mongoDB performance?

Use profiler: http://www.mongodb.org/display/DOCS/Database+Profiler

Move some collection caches to RAM: MongoDB preload documents into RAM for better performance

Some ideas about collection allocation size: MongoDB data schema performance

Use separate collections: MongoDB performance with growing data structure

A single query can only use one index (better is a compound one): Why is this mongodb query so slow?

A missing key?: Slow MongoDB query: can you explain why?

Maybe shards: MongoDB's performance on aggregation queries

Improving performance stackoverflow links: https://stackoverflow.com/a/7635093/602018

A good point for further sharding replica education is: https://education.10gen.com/courses

Community
  • 1
  • 1
42n4
  • 1,292
  • 22
  • 26
  • Thank you. I will read all this recomedations and see what i can get of it. Some ideas about the data model? – Nicolas Alejo Nov 23 '12 at 19:02
  • I think it is an end of free RAM. However, it must be check. And data should be divided into history data on disk and current data in RAM. MongoDB is fast if often used collections are in RAM. – 42n4 Nov 23 '12 at 21:25
  • With the current model is diffcult to clean old data. I'm going back to the old model without documents inside documents, now having in mind that i have to map/reduce all the possible results i want. – Nicolas Alejo Nov 28 '12 at 13:32