-1

The holy grail of CouchDB is its replication feature. With TouchDB, Cloudant-Sync and Couchbase-Lite you can even replicate a database from\to the users' smartphones, so the data will be available even if there are connectivity problems.

The CouchDB replication protocol (which may be implemented slightly differently across different frameworks\sdks) makes a GET request for every document that has changed.

Both Cloudant and Iris-Couch provide pricing programs that are based on the size of the database, the number of light http requests (GET, HEAD) and the number of heavy http requests (PUT, POST, DELETE). This means that calling a GET for a single document has the same price as calling a GET to /_all_docs.

In some sense, it looks like the replication protocol is very inefficient when it comes to these pricing programs. For example, if your users only pull documents from the server, it may be cheaper to use /_all_docs?include_docs=true than running a standard replication, even if the /_all_docs request makes you download documents that did not change...

Am I missing something? Shouldn't the pricing programs consider the amount of data being downloaded\uploaded instead of the number of requests? Shouldn't a GET request of a single document be much cheaper than calling /_all_docs or views? Could the replication protocol be tweaked so it would be less efficient in terms of bandwidth but much cheaper?

P.S. I know that Couchbase is a separate project and the the CouchDB replication protocol is irrelenent to it. Couchbase also support replication from\to clients (via Couchbase Lite). Is there any way to compare the two mechanisms, in terms of number of requests to the server?

--- EDIT ---

It looks like /_all_docs is being used in the Couchbase-Lite replication algorithm, not to reduce the cost but to optimize the process: https://github.com/couchbase/couchbase-lite-ios/wiki/Replication-Algorithm

  • A limited case of the above-mentioned bulk-get optimization is possible with the standard API: revisions of generation 1 (revision ID starts with “1-”) can be fetched in bulk via _all_docs, because by definition they have no revision histories. Unfortunately _all_docs can’t include attachment bodies, so if it returns a document whose JSON indicates it has attachments, those will have to be fetched separately. Nonetheless, this optimization can help significantly, and is currently implemented in Couchbase Lite.

-- EDIT --

This issue is being handled in Couchbase Sync Gateway, not as a part of CouchDB: https://github.com/couchbase/sync_gateway/wiki/Bulk-GET

I wonder if this is ever going to be implemented in CouchDB. It looks like the service providers that charge per request don't have an interest to support this feature...

Oren
  • 2,767
  • 3
  • 25
  • 37
  • I'm voting to close this question as off-topic because pricing questions are not about programming, and thus off-topic. – Jonathan Hall Dec 08 '19 at 10:01

1 Answers1

4

You have a point and then again it does not matter.

Why you have a point

Indeed running a single /_all_docs request is only a single request returning all of your documents. You just found a way to cheat you host into giving you a 'free service'.

Why it does not matter

  • Replication needs to be efficient so you really don't want to have the slave couch check every document that may have been updated against _all_docs in the master. Even if you really wanted to do that, to retain reasonable consistency, the updates would likely only see a small level of change so if 1 in a 1000 documents gets updated between 2 replications, then the overhead cost for replicating by document is pretty small.

  • Assume you run a blog/application that queries _all_docs to minimze the requests. Well done, if your application is meant to be responsive and you need 5 kByte of documents from a database with 50 MByte database, you just lost a whole lot of users because you'll be as unresponsive as anything.

  • You optimize at the wrong end. You will typically hit a $ 20 limit when having around 1 million get requests. If you have a website with that sort of level of traffic and you run Ads on it, you'll likely manage to get well in excess of $500 (assuming eCPM of $0.5). You'll be much more likely to increase your revenue by adding content than by squeezing the cost of your couchdb.

Hans
  • 2,800
  • 3
  • 28
  • 40
  • If you have 10K users with your smartphone app pulling from a DB with 1K documents, you will easily reach 10M GET requests. At least the first replication should use _all_docs, because the user downloads all the documents anyway. – Oren Nov 29 '14 at 10:45
  • 1
    I still would not use _all_docs, but maybe write a dedicated view that emits the relevant documents only. If I was one of your users and on a mobile network (3G or less), had only 500 MByte of monthly data I'd not be pleased to know that you send me 10MByte, at least not on a regular basis. If you code something like a high-score list I'd recommend you write a dedicated view and query it with startkey and endkey parameters so you have two get requests, say one for the top x players and one for the players above and below the current score. – Hans Nov 29 '14 at 11:00
  • Even if you have only 100 documents of size 1kb each, it would be 100 times cheaper (!) to use _all_docs instead of a standard first replication – Oren Nov 29 '14 at 11:19
  • I appreciate this, yet again, by spending time to optimize the user experience - dedicated small views = small data downloads, limiting the keys to avoid pulling a full set of results will likely be more productive than pulling _all_docs. If your product is pulling these 100kByte 50 times you have just spent 1MByte. You should ask yourself if you can do this with 1 view and two get requests at different start/end key settings containing only the data you require. – Hans Dec 01 '14 at 19:40