3

I am currently looking at CouchDB and I understand that I have to specify all the replications by hand. If I want to use it on 100 nodes how would I do the replication?

  • Doing 99 "replicate to" and 99 "replicate from" on each node
    • It feels like it would be overkill since a node replication includes all the other nodes replications to it
  • Doing 1 replicate to the next one to form a circle (like A -> B -> C -> A)
    • Would work until one crash, then all wait until it comes back
    • The latency would be big for replicating from the first to the last

Isn't there a way to say: "here are 3 IPs on the full network. Connect to them and share with everyone as you see fit like an independent P2P" ?

Thanks for your insight

Simon Levesque
  • 417
  • 1
  • 6
  • 13
  • 1
    Maybe [BigCouch](https://github.com/cloudant/bigcouch) is what you should use instead? It basically takes big clusters of nodes and allows them to appear as a single instance of CouchDB to end-users/applications. – Dominic Barnes Nov 30 '12 at 14:59
  • I agree with Dominic. Have a look at Cloudant and save yourself the trouble. What you are probably after is sharding which is what BigCouch (and Cloudant) does for you. – AndyD Dec 11 '12 at 16:50

1 Answers1

1

BigCouch won't provide the cross data-center stuff out of the box. Cloudant DBaaS (based on BigCouch) does have this setup already across several data-centers.

BigCouch is a sharded "Dynamo-style" fork of Apache CouchDB--it is to be merged into the "mainline" Apache CouchDB in the future, fwiw. The shards live across nodes (servers) in the same data-center. "Classic" CouchDB-style Replication is used (afaik) to keep the BigCouches in the various data-centers insync.

CouchDB-style replication (n-master) is change-based, so replication only includes the latest changes.

You would need to setup to/from pairs of replication for each node/database combination. However, if all of your servers are intended to be identical, replication won't actually happen that often--it will only happen if needed.

If A gets a change, replication ships it to B and C (etc). However, if B--having just got that change--replicates it to C before A gets the chance too--due to network latency, etc--when A does finally try, it will realize the data is already there, and not bother sending the change again.

If this is a standard part of your setup (i.e., every time you make a db you want it replicated everywhere else), then I'd highly recommend automating the setup.

Also, checkout the _replicator database. It's much easier to manage what's going on: https://gist.github.com/fdmanana/832610

Hope something in there is useful. :)

BigBlueHat
  • 2,355
  • 25
  • 30