0

I am having huge problems with my cluster. Servers keeps disconnecting for unknown reason (there is nothing in logs) and crashing for unknown reason. I think I might have cluster setup wrong.

First this is first, I understand sharding and that is great feature but what are:

"n replica per shard"?

What that means?

Second thing. How to configure cluster with "n" servers? I have 6 servers because of sharding (i have few docs with more then 10mil records) but I am not sure that I configured my cluster correctly.

On every server I wrote:

for example (srv1.conf)
join=srv2:port
join=srv3:port
join=srv4:port
join=srv5:port
join=srv6:port

Is this correct way to add server to cluster?

There is nothing in docs and it would be great if you can post some "recommended" cluster configuration.

And third thing is about failover. In my 6 cluster server all tables have 6
shards with three replicas. Once i shout down for example server 1 app goes down and some crazy writes being on cluster. What is point of cluster if I do not have some redundancy if some other server goes down?

I really hope that someone could help me with this because when I had just one server my app was working all the time. Now every time that some servers get disconnected everything crashes. I am using nodejs rethinkdbdash.

UPDATE

I know what shard is, I have 2mil records for one table for example and they are distributed among 6 servers (for me this is important because of read speed). I do not understand what is "replica". Every table is configured like this, 6 shards and 3 replicas per shard. From what you said that means if some server goes down table will be available for read, but it is not (saying something like set read_mode=outdated and app crash). There is no way that I am going to change every part of app that is doing read and say read_mode= outdated. That is just poor programming.

There is nothing in logs. On every server in dmesg I have this:

TCP: TCP: Possible SYN flooding on port 28015. Sending cookies.  Check SNMP counters.
pregmatch
  • 2,629
  • 6
  • 31
  • 68

1 Answers1

0

Servers keeps disconnecting for unknown reason (there is nothing in logs) and crashing for unknown reason.

It's going to be difficult to help you with the crashing if there's nothing in the logs. What does your init manager say if you're using a init manager like systemd? Did RethinkDB quit or did it stop responding? How much memory is there available for RethinkDB? Are there any messages relating to RethinkDB in dmesg or your syslog? Do the logs at least tell you the servers are disconnected? Is the web interface reporting any issues?

"n replica per shard"?

What that means?

So let's say we have a pizza which represents the data in a database. A shard is where you cut the pizza into smaller slices, so let's say we cut it into 4 slices (shards). To have n replicas, we simply make n copies of each slice. Let's make n = 3, so we have 4 shards, and 3 replicas for each shard, adding up to a total of 12 pieces. Now what can do is distribute these pieces across multiple servers.

So for your case you seem to want a system with high availability, that requires a minimum number of 3 replicas (thus 3 servers), although an odd number is preferred as majority of replicas must be available for the database to continue operating. For the database to operate, majority of replicas for each shard must be available. Let's say I have 2 shards each with 3 replicas distributed across 6 servers, each with a single replica of a shard. If 1 server goes down, that's okay, because there will be another 2 replicas (servers storing the same data as the server that went down), and because 2/3 replicas are available (majority), the database can continue to operate.

On every server I wrote: ... Is this correct way to add server to cluster?

You must specify the server's canonical-address, which is the address (not including port) other servers will use to connect to it, and you should only provide 1 join argument, as the database will automatically ask the server it's joining for the list of addresses of all the servers connected to the cluster. All servers in the cluster must be able to communicate to each other using the canonical-address.

There is nothing in docs and it would be great if you can post some "recommended" cluster configuration.

Here's what my configuration file looks like for a cluster:

bind=all
canonical-address=server.domain.com
driver-port=28015
cluster-port=29015
join=otherserver.domain.com:29015

cluster-tls-key=/path/to/key.pem
cluster-tls-cert=/path/to/cert.pem
cluster-tls-ca=/path/to/cert.pem

I've setup TLS for intracluster communication as my servers need to communicate over the Internet, and I want it to be encrypted. Refer to https://www.rethinkdb.com/docs/security/ on information on securing your cluster. You can also encrypt driver connections.

What is point of cluster if I do not have some redundancy if some other server goes down?

You can setup replicas for your database. I've explained some of the concepts of this above.

Information regarding replication can be found here: https://www.rethinkdb.com/docs/sharding-and-replication/

1lann
  • 647
  • 7
  • 11
  • I've explained what replicas are in my post, as a summary they are just copies of a shard. What you're describing sounds like an issue with the driver connection, not the cluster communication (i.e. it's not the database's fault, it's your application/driver's fault). What database driver are you using? Are you spamming the server with queries that could otherwise be grouped into a single query? Are you trying to run a lot of queries (more than 1000) at once? If so, your driver may be creating more connections for each one, and causes connection issues. – 1lann Sep 23 '16 at 16:50
  • servers are going offline and then tables are re-configuring. i have 6 servers and for each table i have 6 shards with 3 replica per shard. – pregmatch Sep 27 '16 at 19:00
  • i do not have lot of queiries at all. i am using rethinkdbdash – pregmatch Sep 27 '16 at 19:01
  • in my app log i am getting bunch of (edited) Cannot perform read: primary replica for shard ["Sd51611d1\x2D072d\x2D4f1c\x2D8433\x2D5b93cbb834b4", +inf) not available in: – pregmatch Sep 27 '16 at 19:01
  • That is strange behaviour, what is the exact issue that RethinkDB reports in the web interface? It should say something like the following tables are unavailable due to the following servers being disconnected, can you verify there is only 1 server listed? Also can you show the replica distribution? It's in the web interface for the table listed as "Servers used by this table". Can you take a screenshot of this before and after a server goes down? I'm also on the RethinkDB Slack channel as 1lann, I realise you're nikola. – 1lann Sep 28 '16 at 05:36
  • first image is when some servers goes down: https://i.imgsafe.org/cfb2a5bd10.png, second one is app error that i am getting: https://i.imgsafe.org/cfb28bcd34.png and third one is for "servers used by this table": https://i.imgsafe.org/cfb27c912f.png – pregmatch Sep 29 '16 at 11:31
  • There appears to be mixed errors, it says in the issues that `db3` is disconnected yet in the "servers used by this table", `db4` is down. Either way your table should not become unavailable. If what you say is true and that no other messages/errors are logged, then you should take this to GitHub issues as it seems to be a fault in RethinkDB: https://github.com/rethinkdb/rethinkdb/issues it'll also bring it to the attention of RethinkDB developers who check GitHub issues more. I can't see anything wrong in your configuration so I can't tell what's wrong, sorry. – 1lann Sep 29 '16 at 17:14
  • Some information you should post in your issue to help others figure out your problem faster are your raw logs, a list of all the issues from `r.db("rethinkdb").table("current_issues")`, a list of your table configuration from `r.db("rethinkdb").table("table_config")`, and the status of all your tables from `r.db("rethinkdb").table("table_status")` when this issue occurs. – 1lann Sep 29 '16 at 17:17