490KB insert crashes BigCouch

Question

I have a BigCouch cluster with Q=256, N=3, R=2, W=2. Everything seems to be up and running and I can read and write small test documents. The application is in Python and uses the CouchDB library. The cluster has 3 nodes, each on CentOS on vxware with 3 cores and 6GB RAM each. BigCouch 0.4.0, CouchDB 1.1.1, Erlang R14B04, Linux version CentOS Linux release 6.0 (Final) on EC2 and CentOS release 6.2 (Final) on vmware 5.0.

Starting the application attempts to do a bulk insert with 412 documents and a total of 490KB data. This works fine with N=1 so there isn't an issue with the contents. But when N=3 I seem to randomly get one of these results:

write completes in about 9 sec
write completes in about 24 sec (nothing in between)
write fails after about 30sec (some documents were inserted)
Erlang crashes after about 30sec (some documents were inserted)

vmstat shows near 100% CPU utilization, top shows this is mostly the Erlang process, truss shows this is mostly spent in "futex" calls. Disk usage jumps up and down during the operation, but CPU remains pegged.

The logs show lovely messages like:

"could not load validation funs {{badmatch, {error, timeout}}, [{couch_db, '-load_validation_funs/1-fun-1-', 1}]}"

"Error in process <0.13489.10> on node 'bigcouch-test02@bigcouch-test02.oceanobservatories.org' with exit value: {{badmatch,{error,timeout}},[{couch_db,'-load_validation_funs/1-fun-1-',1}]}"

And of course there are Erlang dumps.

From reading about other people's use of BigCouch, this certainly isn't a large update. Our VMs seem beefy enough for the job. I can reproduce with cURL and a JSON file, so it isn't the application. (Can post that too if it helps.)

Can anyone explain why 9 cores and 18GB RAM can't handle a (3x) 490KB write?

more info in case it helps:

bigcouch.log entries including longer crash report
JSON entries that repeatably cause the failure
erl_crash.dump from an EC2 machine m1.small trying to allocate 500mb heap

can reproduce with commands: download above JSON entries as file.json

url=http://yourhost:5984
curl -X PUT $url/test
curl -X POST $url/test/_bulk_docs -d @file.json -H "Content-Type: application/json"

Get the basics down: What version of Erlang are you using? What version of BigCouch? — I GIVE CRAP ANSWERS, Oct 11 '12 at 14:40
Hint: Could you trace the `couch_query_servers:validate_doc_update/5` function? — Roberto Aloi, Oct 12 '12 at 08:34
Roberto: got a good link to get a non-erlang person started on this? — user634943, Oct 15 '12 at 20:58

score 0 · Answer 1 · answered Oct 15 '12 at 21:15

Got a suggestion that Q=256 may be the issue and found that BigCouch does slow down a lot as Q grows. This is a surprise to me -- I would think the hashing and delegation would be pretty lightweight. But perhaps it dedicates too many resources to each DB shard.

As Q grows from too small to do allow any real cluster growth to maybe big enough for BigData, the time to do my 490kb update grows from uncomfortably slow to unreasonably slow and finally into the realm of BigCouch crashes. Here is the time to insert as Q varies with N=3, R=W=2, 3-nodes as originally described:

Q      sec
4      6.4
8      7.7
16    10.8
32    16.9
64    37.0  <-- specific suggestion in adam@cloudant's webcast

This seems like an achilles heel for BigCouch: in spite of suggestions to overshard to support later growth of your cluster, you can't have enough shards unless you already have a moderate-sized cluster or some powerful hardware!

490KB insert crashes BigCouch

1 Answers1