5

The couchbase admin console (I'm using version 5.0, community) shows a count of items in each bucket. I'm wondering if that count is just a rough estimate and not an exact count of the number of items in the bucket. Here's the behavior I'm seeing that leads me to this reasoning:

  • When I use XDCR to replicate a bucket to a backup node, the count in the backup bucket after the XDCR has finished will be significantly higher than the count of documents in the source bucket, sometimes by tens of thousands (in a bucket that contains hundreds of millions of documents).
  • When I use the Java DCP client to clone a bucket to a table in a different database, the other database shows numbers of records that are close, but off by possibly even a few million (again, in a bucket with hundreds of millions of documents).

How can I get an accurate count of the exact number of items in a bucket, so that I can be sure, after my DCP or XDCR processes have completed, that all documents have made it to the new location?

Murphy Randle
  • 816
  • 10
  • 19
  • 1
    A count of items in couchbase is somewhat meaningless. As couchbase is *eventually-consistent*, using things like count to do anything other than make order of magnitude comparisons is not a useful thing to do. The management console will give you number of items in each bucket. – theMayer Jul 20 '18 at 16:49
  • Very interesting, @theMayer. Great point about the eventual consistency. If I were to stop writes to the cluster, I should be able to get an accurate count at some point, correct? In this case I'm trying to migrate away from couchbase to another database, and I want a way to verify that all documents have safely transferred before flipping the switch. – Murphy Randle Jul 20 '18 at 16:52
  • 1
    In theory, but I don't think a simple count really provides much assurance that "all documents have safely transferred." If your application can't handle a little bit of data loss (most can), then your only fallback is to write some sort of hashing/checksum function that you can run on both sides for every document. – theMayer Jul 20 '18 at 16:54
  • I'm not totally sure what goes on in XDCR nowadays, but in prior versions, that's actually how the XDCR mechanism operated under the hood. – theMayer Jul 20 '18 at 16:55
  • Very interesting. Thanks for the answer! Yes, the type of data loss is what's important to know, I guess. Ephemeral things like password reset tokens are no big deal to lose, but user data must all be transferred safely. I'll have to do more research on that. – Murphy Randle Jul 20 '18 at 16:57
  • And yes, as I understand it, XDCR was / is built on DCP. – Murphy Randle Jul 20 '18 at 16:58

4 Answers4

2

There can be a number of different reasons why the count could be different without more details it would be hard to say. The common cases are:

The couchbase admin console (I'm using version 5.0, community) shows a count of items in each bucket.

The Admin console is accurate but does not auto updated, so a refresh is required.

When I use the Java DCP client to clone a bucket to a table in a different database, the other database shows numbers of records that are close, but off by possibly even a few million (again, in a bucket with hundreds of millions of documents).

DCP will include tombstones (deleted documents) and possibly multiple mutations for the same document. Which could explain why the DCP count is out.

With regards to using N1QL, if the query is a simple SELECT COUNT(*) FROM bucketName then depending on the Couchbase Server version it will use the bucket stats directly.

In other words as mentioned previously the bucket stats via the REST interface or by asking the Data service directly will be accurate.

Paddy
  • 1,195
  • 9
  • 16
1

The most accurate answer would be to go directly to the bucket info something like

curl http://hostname:8091/pools/default/buckets/beer-sample/ -u user:password | jq '.basicStats | {itemCount: .itemCount }'

the result would be immediate, no need for indexing:

{
  "itemCount": 7303
}

or not in Json format

curl http://centos:8091/pools/default/buckets/beer-sample/ -u roi:password | jq '.basicStats.itemCount'
Roi Katz
  • 575
  • 4
  • 14
1

Alright, here I am to answer my own question over a year later :). We did a lot of experimentation today when trying to migrate items out of a bucket containing roughly 2.6 million items into an SQL database. We wanted to make sure the row count matched between Couchbase and the new database before going live.

Unfortunately when we tried the normal select count(*) from <bucket>; the document count we received was over what we expected by just 1, so we broke down the query and did a count over all documents in the bucket while grouping by an attribute, hoping to find what kind of document was missing in the target DB. The total for the counts for each group should have added up to the same total that we got from the count query. Unfortunately, they did not. The total added up to 1 fewer than we expected (so that's off by two from the original count query).

We found the category of document that was off by 1, expecting to have an extra doc in Couchbase that didn't make it to the target DB, but found instead that the totals indicated the reverse, that the target DB had one extra doc. This all seemed very fishy, so we did a query to pull all of the IDs in that group out into a single JSON file, and we counted them. Alas, the actual count of documents in that group matched up with the target DB, meaning that Couchbase's counting was incorrect in both cases.

I'm not sure what implementation details caused this to happen, but it seems like at least the over-counting might have been a caching issue. I was able to finally get a correct document count by using a query like this:

select count(*) from <bucket> where meta(<bucket>).id;

This query ran for much longer than the original count did, indicating that whatever cache is used for counts was being skipped, and it did come up with the correct number.

We were doing these tests on a relatively small number of documents, half a million or so. With the full volume of the bucket, counts had been off by as much as 15 in the past, apparently becoming less accurate as the document count increased.

We just did a re-sync of the full bucket. The bucket total as reported by the dashboard and by the original N1ql query are over the expected count by 7. We ran the modified query, waited for the result, and got the expected count.

In case you're wondering, we did turn off traffic to the bucket, so document counts were not likely to be fluctuating during this process, except when a document reached its expiry date in Couchbase, and was automatically deleted.

Murphy Randle
  • 816
  • 10
  • 19
  • select count(*) from ; uses bucket stats as described here https://blog.couchbase.com/wp-content/uploads/2017/10/N1QL-A-Practical-Guide-2nd-Edition.pdf Page 168, Use Case 1 – vsr Oct 15 '19 at 15:38
0

To get an accurate count, you can run a N1QL query. That will get you as accurate a number as Couchbase is capable of producing.

SELECT COUNT(*) FROM bucketName

Use REQUEST_PLUS consistency to make sure the indexes have received the very latest updates.

https://developer.couchbase.com/documentation/server/current/indexes/performance-consistency.html

You'll need a query node for this, though.

Johan Larson
  • 1,880
  • 1
  • 11
  • 14
  • Thanks for the answer! Unfortunately, I don't have a query node for N1QL available. But I'm just now thinking, a reduce function should be able to return an accurate count if it just adds 1 to the total for each record in a bucket, right? – Murphy Randle Jul 20 '18 at 16:21
  • bucket stats should have count Check this out https://developer.couchbase.com/documentation/server/5.1/rest-api/rest-bucket-info.html – vsr Jul 20 '18 at 20:30