0

I have a crawler that tests web sites/pages. And below is the model I'd do with RDBMS:

class Site{
   Uri Uri {set;get;}
   Collection<Test> Test{set;get;}
}

class Test{
   Collection<Page> Pages {set;get;}
}

class Page{
   // Page info
}

My queries would be like how many pages failed to load, how many returned 404, etc. per site and overall.

So my concern with couchbase is document size, 20 MB, some sites I Crawl has 10K pages. If I crawl a couple lets say 10 times, Site object will exceed this limit and it eventually will.

What is the correct way to do Modelling here?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
DarthVader
  • 52,984
  • 76
  • 209
  • 300

1 Answers1

1

There is no correct way of modeling this without a lot more information. I can think of a couple of ways off the top of my head which may or may not work for you, but here is my first one.

  1. Each site could be its own object which is 10k. Use a counter object for each site and use that counter as the version number to use as part of the object ID on each document based on the counter. So the object ID might look like "::" Then when you need the latest version, you just get the value of the counter object and then a get of the object ID you need. Very easy and VERY fast.

For more information on advanced object modeling with keys in Couchbase, might I recommend this blog post and this blog post that might give you some good ideas. These two examples are not exactly pertinent, but they should help get you thinking of how you might model your data and why you might make certain decision that help you take specific advantage of Couchbase's capabilities. Like I said, there really is not a specifically correct way. It will depend on your data, the use case and the performance you need to get out of Couchbase. There are better ways than others is all.

NoSQLKnowHow
  • 4,449
  • 23
  • 35