3

I am starting to work on CouchDB for collecting analytical information from Facebook Insights and other sources. I am not sure about a proper design of a document and would like more experienced CouchDB users to see it and warn me if I am about to make any big mistake.

{
"_id": "0b69a33807d4cb63680dbebc16000af5",
"_rev": "1-7c9916592c377e32cf83acf746a8647c",
//array of metrics, one element per facebook page, around 10 pages per document**
"metrics": [        
    {
        "sourceId": "210627525692699", //facebook page ID
        "source": "facebook",
        "values": {
           "page_likes": 53
           //many more other metrics, around 100
       }
   },
   {
       "sourceId": "354413697924499", // //facebook page ID
       "source": "facebook",
       "values": {
           "page_wall_posts_source_unique": {other: 0, composer: 1},
           "page_likes": 12
           //many more other metrics, around 100
       }
   }
],
"timestamp": [
   2012,
   10,
   15,
   10,
   0,
   0
],
"customerId": "71ff942f-9283-4916-ab84-4927bce09117"
}

Expected number of documents: +10 000 every hour, +240 000 every day.

Expected requests to the documents:

  • sum of values per customer, per sourceId, per metric in a given time period
  • specialized views for more complex metrics

Questions:

  • In order to get analytics for some complex metrics (like page_wall_posts_source_unique) we will need to build specialized views, probably many of them, should I expect problems with view update time?
  • Is it right decision to use an array for the timestamp, or it is better to use a long?
  • Should I use one design document or put every view in a new one?
  • This is an excellent question. I am not expert enough to offer a definitive answer but I will offer an opinion. The use of an array for timestamp is fine but you will find it easier to use a long. (Querying with arrays works but formatting the url correctly is a little painful.) Because views are refreshed when their design document is updated you might want to keep the views in separate design docs. – lambmj Dec 06 '12 at 18:23

2 Answers2

0

I think you'd better not use CouchDb for such purpose. One of your greatest goal as I see is making some aggregation stuff across your data - and it's it not the main thing that CouchDb designed for.

Actually, CouchDb have quite pure queering an aggregation part ( as I found from the real experience from it, I implementing it in 3 projects). Ofcause you could add Lucene to it like a fool text search part and it will extend it's query features, but anyway it will be less then you probably need to have. CouchDb is ideally suite for the Wikipedia likely projects, because each time when you updating the doc it create document with new revision and you still have the old version. Thats of of the main feature and looking to your project I do not see that you want to use that.

Also, CouchDb is not for the millions of small documents. It's prefer to manipulating with average amount of the middle sized docs. But millions of the small docs is not perfect thing for CouchDb views systems.

So I advice you to select your main goals and take a look to te others NoSQL solution, because in NoSQL world there are no one solution for all goals, instead there are own solution for the selected goals, not like in SQL, when you use one for all things. For the first look I think that MongoDB should fit your goals.

But, anyway, answering your question: 1) Think yes, but it depend how many docs will be recalculating 2) I prefere to use Long value bacuse it's when you have single value you could query it, bit if you will have the array of different value, you will have problem with it. And also using longs like timestamps it's common practice. 3) There no big deal. You could do any way you want.

Ph0en1x
  • 9,943
  • 8
  • 48
  • 97
0

Thank you for responses guys.

Ph0en1x, I partially agree with you, CouchDB was not an obvious choice, but I am even less sure of other options, so far will stick to CouchDB.

Anyway here are the answers I collected from multiple sources:

1) obviously, it depends on the number of documents. but with small documents the probability grows.

2) both approaches would work, timestamp is a little bit more universal.

3) The more views in one document the higher probability of having them reindexed more often. So I am trying to keep the number of views in one doc as minimal as possible.