Unbalanced Disk Usage Among Nodes in CouchDB

Question

I've built up a CouchDB cluster of 4 nodes to store the tweets I retrieved

The cluster was configured to have 8 shards and keep 3 copies of each document

[cluster]
q=8
r=2
w=2
n=3

I haven't added any views or additional indexes and the size of the database shown in Fauxton is 4.3 GB

However, CouchDB is taking up exceptionally large disk space in one of the nodes

$ ansible -i hosts -s -m shell -a 'du /vol/couchdb/shards/* -sh' couchdb
crake.couchdb.cloud | SUCCESS | rc=0 >>
363M    /vol/couchdb/shards/00000000-1fffffff
990M    /vol/couchdb/shards/20000000-3fffffff
17G     /vol/couchdb/shards/40000000-5fffffff
1.4G    /vol/couchdb/shards/60000000-7fffffff
359M    /vol/couchdb/shards/80000000-9fffffff
989M    /vol/couchdb/shards/a0000000-bfffffff
12G     /vol/couchdb/shards/c0000000-dfffffff
1.6G    /vol/couchdb/shards/e0000000-ffffffff

darter.couchdb.cloud | SUCCESS | rc=0 >>
1.4G    /vol/couchdb/shards/00000000-1fffffff
367M    /vol/couchdb/shards/20000000-3fffffff
1001M   /vol/couchdb/shards/40000000-5fffffff
1.4G    /vol/couchdb/shards/60000000-7fffffff
1.4G    /vol/couchdb/shards/80000000-9fffffff
364M    /vol/couchdb/shards/a0000000-bfffffff
998M    /vol/couchdb/shards/c0000000-dfffffff
1.4G    /vol/couchdb/shards/e0000000-ffffffff

bustard.couchdb.cloud | SUCCESS | rc=0 >>
1004M   /vol/couchdb/shards/00000000-1fffffff
1.4G    /vol/couchdb/shards/20000000-3fffffff
1.4G    /vol/couchdb/shards/40000000-5fffffff
365M    /vol/couchdb/shards/60000000-7fffffff
1001M   /vol/couchdb/shards/80000000-9fffffff
1.4G    /vol/couchdb/shards/a0000000-bfffffff
1.4G    /vol/couchdb/shards/c0000000-dfffffff
364M    /vol/couchdb/shards/e0000000-ffffffff

avocet.couchdb.cloud | SUCCESS | rc=0 >>
1.4G    /vol/couchdb/shards/00000000-1fffffff
1.4G    /vol/couchdb/shards/20000000-3fffffff
368M    /vol/couchdb/shards/40000000-5fffffff
999M    /vol/couchdb/shards/60000000-7fffffff
1.4G    /vol/couchdb/shards/80000000-9fffffff
1.4G    /vol/couchdb/shards/a0000000-bfffffff
364M    /vol/couchdb/shards/c0000000-dfffffff
1001M   /vol/couchdb/shards/e0000000-ffffffff

In crake.couchdb.cloud, two of the shards, 40000000-5fffffff and c0000000-dfffffff, are far larger than others.

I once tried deleting those large shards in crake.couchdb.cloud and waited CouchDB itself to rebuild. The disk usage was balanced after the rebuild however it gradually went unbalanced again after I started adding new documents to the database.

I'm using MD5(tweet[id_str]) as the document ID. Could this be the reason of the issue?

I feel really confused about this as I think even if I've made any mistakes, it should have eaten up the resources of 3 different nodes as the data are replicated across the cluster.

Please help, thanks.

UPDATE

Later I deleted all the VPS instances and rebuilt the cluster with 3 CouchDB nodes, namely Avocet, Bustard and Crake. The new cluster configuration is as following:

[cluster]
q=12
r=2
w=2
n=2

Before the rebuilding, I replicated all the data to an alternative CouchDB instance so I could transfer them back after it's done. The disk usage was balanced after the restoration.

Additionally, I introduced a HAProxy on the 4th node, namely Darter, as a load balancer.

So this time, all my twitter retrievers send their requests to the balancer. However the disk usage became unbalanced again, and it was exactly the 3rd node Crake which took up much more space.

bustard.couchdb.cloud | SUCCESS | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc         81G  9.4G   68G  13% /vol

avocet.couchdb.cloud | SUCCESS | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc         81G  9.3G   68G  13% /vol

crake.couchdb.cloud | SUCCESS | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc         81G   30G   48G  39% /vol

The database size is only 4.2 GB and Crake used approximately 7 times larger than that!

I'm completely clueless now...

UPDATE 2

The _dbs info from all the nodes

crake.couchdb.cloud | SUCCESS | rc=0 >>
{
    "db_name": "_dbs",
    "update_seq": "11-g2wAAAABaANkABtjb3VjaGRiQGNyYWtlLmNvdWNoZGIuY2xvdWRsAAAAAmEAbgQA_____2phC2o",
    "sizes": {
        "file": 131281,
        "external": 8313,
        "active": 9975
    },
    "purge_seq": 0,
    "other": {
        "data_size": 8313
    },
    "doc_del_count": 0,
    "doc_count": 7,
    "disk_size": 131281,
    "disk_format_version": 6,
    "data_size": 9975,
    "compact_running": false,
    "instance_start_time": "0"
}

avocet.couchdb.cloud | SUCCESS | rc=0 >>
{
    "db_name": "_dbs",
    "update_seq": "15-g2wAAAABaANkABxjb3VjaGRiQGF2b2NldC5jb3VjaGRiLmNsb3VkbAAAAAJhAG4EAP____9qYQ9q",
    "sizes": {
        "file": 159954,
        "external": 8313,
        "active": 10444
    },
    "purge_seq": 0,
    "other": {
        "data_size": 8313
    },
    "doc_del_count": 0,
    "doc_count": 7,
    "disk_size": 159954,
    "disk_format_version": 6,
    "data_size": 10444,
    "compact_running": false,
    "instance_start_time": "0"
}

bustard.couchdb.cloud | SUCCESS | rc=0 >>
{
    "db_name": "_dbs",
    "update_seq": "15-g2wAAAABaANkAB1jb3VjaGRiQGJ1c3RhcmQuY291Y2hkYi5jbG91ZGwAAAACYQBuBAD_____amEPag",
    "sizes": {
        "file": 159955,
        "external": 8313,
        "active": 9999
    },
    "purge_seq": 0,
    "other": {
        "data_size": 8313
    },
    "doc_del_count": 0,
    "doc_count": 7,
    "disk_size": 159955,
    "disk_format_version": 6,
    "data_size": 9999,
    "compact_running": false,
    "instance_start_time": "0"
}

It sounds very strange to me too, I think you need to look at which databases have the large shards, what their _dbs entries say and whether their shards are small on the other nodes or nonexistent. — lossleader, Apr 27 '17 at 09:43
@lossleader I just deleted the whole cluster and rebuilt a new one with 3 nodes (2 doc replicas). Additionally, I introduced HAProxy to the system as a balancer. Again, after transferring all data back from an alternative CouchDB instance, all ok. Started retrieving new tweets, unbalanced. — Frederick Zhang, Apr 28 '17 at 05:53
@lossleader And it seems that it's always the 3rd node which takes a lot more storage. Currently the DB size is `4.2 GB` however the 3rd node has already used up `30 GB`, which is approx 7 times larger! I feel really lost now — Frederick Zhang, Apr 28 '17 at 05:53
@lossleader I've just pasted the `_dbs` from all nodes but frankly I do not understand some of the metrics in it. The node `Crake` that's eating up storage has the same `data_size` as others however the `disk_size` is rather greater. — Frederick Zhang, Apr 28 '17 at 07:04

Unbalanced Disk Usage Among Nodes in CouchDB

0 Answers0