0

There are 500,000 documents stored in a CouchDB database. Client app needs to retrieve all docs for processing into another system. Is there a recommended way for retrieving all? I understand there is paging support using "limit" and "skip" parameters. It looks like a call can be made to get total docs then use a loop to call CouchDB dynamically updating "limit" and "skip" values. Is there an alternative way for retrieving all?

obautista
  • 3,517
  • 13
  • 48
  • 83

1 Answers1

1

Aside from replication I think not. Of course it really depends on specifics not given in the OP. 500k of 200b docs may not be a bandwidth issue but 500k of 100kb documents might be a consideration.

There a lot of ways to approach this and since there are a lot of details not given, all most can do is offer a generic approach which I will do here.

The essence is to use /{db}/_all_docs with a combination of start_key, limit and skip.

The initial state should be

  • start_key = null Because null is first in the CouchDB Views Collation
  • limit = ? Arbitrary as it depends on average document size, bandwidth, processing power etc.
  • skip = 0 One doesn't want to skip anything at the start

The general solution is to adjust start_key and limit according to the last response.

Do note that skip can be very inefficient. In this solution skip is either 0 or 1 which is quite OK.

Each successive state depends upon the prior response:

  • start_key = last rows doc key Can't know what the next key is, right?
  • skip = 1 So the response doesn't include the last response doc

In other words, a subsequent request is saying "Give me the next set of docs starting one past the last document key received".

Here's a nano based script that provides a skeleton upon which to throw meat. It is naïve as it suggests URL credentials and has no error handling for clarity.

const nano = require("nano")("http://{uid:pwd@address:port");
const db = nano.db.use("{your db name}");
const echo = (json) => console.log(JSON.stringify(json, undefined, 2));
const processRows = (rows) => {  
    echo(rows);
};
(async () => {
    let start_key = null;
    let limit = 2; // whatever
    let skip = 0;
  let response;
    let more = false;
    
    do {
        if (response) {
            // next query is based on the last query.
            start_key = response.rows.pop().key;
            skip = 1;
        }
        response = await db.list({ start_key, limit, skip });
        processRows(response.rows);
        more = response.rows.length === limit;
    } while (more);
    console.info("Procesing completed.");
})();

Final words, this will return _design_docs too - probably want to filter those away.

Update
I neglected to add the actual answer: The default is to return all rows as stated in CouchDB document section 1.5.4.4. Using Limits and Skipping Rows, so it's up to the caller.

RamblinRose
  • 4,883
  • 2
  • 21
  • 33