0

Is Couchbase capable of storing multiple lists, each holds between 100,000-100,000,000 records?

The records are stored in a "data series" fashion (or delayed queue) and queried accordingly.


Example

List dataset structure:

  • id
  • list_id # the list the record belongs
  • next_check timestamp
  • status
  • some other fields..

Typical use case:

Select All records that have next_check in the past and a specific status.

SELECT * FROM RECORDS
WHERE next_check < now()
  AND status = X
  limit, offset

Then I can perform several actions:

  • Update the record with a new next_check/status values.
  • OR delete the record and insert a new one.

Questions

What I'm trying to understand is this:

  1. If Couchbase can handle such huge dataset?
  2. What is the best way to store and query such structure?
  3. and finally, is there any Couchbase limitation I need to pay attention to (i.e don't use more then 1000 buckets)?

Thanks!

eldad87
  • 195
  • 1
  • 13
  • How many lists are you likely to have? 10? 100? 5? Also, how selective is the query likely to be? Is it going to cover most of the records? Or just a small fraction of the records? – EbenH Sep 21 '17 at 21:29
  • I'm going to have 5-20 lists per customer, the max records in each list is depended on Couchbase's limitations. It doesn't mean that I need a bucket for every list... The query is going to cover all records, its like a 'durable queue' which store future events. – eldad87 Sep 26 '17 at 08:51

1 Answers1

0

To answer your questions, I will need to describe a few things about how Couchbase works.

  1. Couchbase stores JSON documents, which support arrays of objects, arrays, or primitive values. You could have a document for each customer, with one or more arrays holding lists related to that customer. The maximum size for a document is 20MB, though usually documents are much smaller. Still, it sounds like 20MB should be much bigger than you need for the lists associated with a customer. Alternately, you might want to store the list elements as documents in and of themselves. Do you have any reason to have separate lists for each customer? Data modeling in Couchbase is just as important as it is in relational databases, but the process is somewhat different. There are several good blog posts on the topic which you can find with your favorite search engine.
  2. Each document is stored as the value in a key-value store. The very fastest way to retrieve a document is via its key. Slower, but still pretty fast, is to have an index on whatever field you are querying on, such as next_check. Couchbase does support indexes on fields inside arrays. As with relational databases, the slowest way to access documents is via sequential scan of all records, and you don't want to do that if you can avoid it.
  3. Couchbase Buckets are collections of documents, each with a unique key. I.e., a keyspace. A Couchbase cluster is limited to 10 buckets, so you certainly can't have 1000 of them. Thus buckets are more analogous to the concept of "databases" in MySql or Oracle. Since Couchbase does not enforce schemas, there is currently no equivalent in Couchbase to the concept of "tables" in relational DBs.

Couchbase can certainly support Buckets with tens or hundreds of millions of documents, I have one with 38million 1kb documents on my laptop. Efficient querying, however, requires defining indexes to match the queries you run, having enough memory to hold your indexes and working set of documents, and possibly scaling the cluster across multiple nodes (which Couchbase makes really easy).

EbenH
  • 556
  • 4
  • 6