CouchDB replication ignoring sporadic documents

Question

I've got a CouchDB setup (CouchDB 2.1.1) for my app, which relies heavily on replication integrity. We are using the "one db per user" approach, with an additional layer of "role" db:s that groups users like the image below.

Recently, while increasing the number of beta testers, we discovered that some documents had not been replicated as they should. We are unable to see any pattern in document size, creation/update time, user or other. The errors seem to happen sporadically, with 2-3 successfully replicated docs followed by 4-6 non-replicated docs.

The server responds with {"error":"not_found","reason":"missing"} on those docs.

Most (but not all) of the user documents has been replicated to the corresponding Role DB, but very few made it all the way to the Master DB. This never happened when testing with < 100 documents (now we're at 1000-1200 docs in the db).

I discovered a problem with the "max open files" setting mentioned in the Performance chapter in the docs and fixed it, but the non-replicated documents are still not replicating. If I open a document and save it, it will replicate.

This is my current theory:

The replication process tried to copy new documents when the user went online
The write process failed due to Linux's "max_open_files" peaked
The master DB still thinks the replication was successful
At a later replication, the master DB ignores those old documents and only tries to replicate new ones

Could this be correct? And can I somehow make the CouchDB server "double check" all documents and the integrity of previous replications?

Thank you for your time and any helpful comments!

score 3 · Accepted Answer · answered Feb 08 '19 at 16:15

I have experienced something similar in the past - when attempting to replicate documents without sufficient permissions the replication fails as it should do. But when the permissions issue is fixed the documents you attempted to replicate cannot then be replicated, although edit/save on the documents fixes the issue. I wonder if this is due to checkpoints? The CouchDb manual says about the "use_checkpoints" flag:

Disabling checkpoints is not recommended as CouchDB will scan the Source database’s changes feed from the beginning.

Though scanning from the beginning sounds like it might fix the problem, so perhaps disabling checkpoints could help. I never got back to that issue at the time so I am afraid this is not a proper answer, just a suggestion.

Thanks, sounds feasible! I'll try it and get back with an update! — davidanton1d, Feb 08 '19 at 17:27
It turned out that most of my missing documents was actually deleted docs where the deletion itself hadn't been replicated due to filters, but temporary disabling checkpoints seemed to reduce the numbers of missing docs. Conclusion: If someone in the future finds this thread after suffering partly corrupted documents, try this! — davidanton1d, Feb 11 '19 at 15:19

CouchDB replication ignoring sporadic documents

1 Answers1

Linked