2

I've tried my best to channel this into a specific and digestible question.

I have a low traffic website that has ~1.5m database read/writes per day and 300k monthly human users. My synchronous Galera cluster consists of 2 nodes in the UK at DC1, 2 nodes in US (DC2) and 1 node in Australia (DC4).

All nodes can serve the website, however, all web traffic is routed to one specific node in DC1. If there's an outage, the DNS is updated and web traffic is routed to another node. With low TTL downtime is a minute at most. Each node can handle triple the amount of web traffic I have and probably 10x the database activity without affecting performance. The other nodes are serving other websites so it's better to keep traffic to one server and only use the resources of another if required.

There are weighted load balancing options through AWS or CloudFlare that I could use to take advantage of all the servers but I can't justify the cost and don't believe it's necessary right now. At some point I'd like to route traffic to users closet node but right now that's not in my budge (20 million monthly DNS requests which is predominately bots/crawlers).

When I went from only having 1 server in DC1 to adding another node in DC1 with Galera, write operations were largely unaffected. However, as the website grew and network outages took out both nodes rendering the website unavailable, there'd be significant financial loss after the website came back online for several weeks - increase in bounce rates, drop in checkouts etc. Essentially people would go to competitors and it would take time for the traffic to recover. After it happening consistently after downtime, I understood I have to keep my site up by any means necessary or lose money.

So I added another node in DC2. As soon as I did this, write operations took a noticeable hit in performance. This was expected because of the physical latency but traffic continued to grow inline with projections and I didn't have any complaints about the increase in load times. I then added another node in DC2 which didn't impact performance and eventually one in DC3 which did impact performance. Again, the performance impact didn't affect traffic growth or user satisfaction.

With Galera, I haven't experienced any data loss yet thankfully and I've had network outages many times and hardware failures several times a year. Each time, I've switched to another node/DC and with low TTL DNS records, it's only been a minute of downtime.

Recently my traffic has been growing exponentially and with celebrity endorsements coming up I want to get ahead of any changes that I need to make before they become an issue.

I've been reading up about Cassandra and its ability to scale with all types of hardware and beating performance of Galera and wondering if it's worth diving into this. Here's my understanding so far - please correct me if I'm wrong.

With Cassandra, by default in a 3 node setup, 2 nodes will have a copy of the data but if you have 2x nodes in DC1 and 1x in DC2, there could be potential for data to be inaccessible if DC1 goes down.

Cassandra can be configured so that all nodes contain all data. I read an article that tested the performance of this and the author presented results that showed changes still performed better than MySQL/Galera replication. So let's say all 5 nodes in all DC's are configured like this. Is this performance achievable while maintaining the same level of data integrity/consistency as Galera or better?

If Cassandra replicates asynchronously, would there not be a few milliseconds-second where data could be lost on an outage? If not, how are they able to maintain better performance without impacting consistency?

igitabout
  • 21
  • 2

2 Answers2

1

Welcome to Stack Overflow! A couple of things:

  1. Try to ask only one question per question.

  2. Most questions here should be about programming. This post isn't, but I'll answer it anyway as we can always move it to the Database Administrators Stack Exchange if we need to.

With Cassandra, by default in a 3 node setup, 2 nodes will have a copy of the data but if you have 2x nodes in DC1 and 1x in DC2, there could be potential for data to be inaccessible if DC1 goes down.

Incorrect. In Cassandra, each data center operates independently of the others. If you lose nodes in DC1, DC2 can still serve data just fine.

Cassandra can be configured so that all nodes contain all data. I read an article that tested the performance of this and the author presented results that showed changes still performed better than MySQL/Galera replication.

I don't recommend this. I've built and supported dozens of mission-critical Cassandra clusters, most of them ran at a replication factor (RF) of 3 (3 copies of the data per DC), regardless of number of nodes. One ran at a RF of 5 on 5 nodes, but it was because it was 2011 and we didn't know any better.

It's important to remember that Apache Cassandra was designed off of principles in the Amazon Dynamo paper. One of the main principles is that in a large-scale deployment, some piece of hardware is always down. The idea behind going with a small RF (3) is that a single node contains a smaller percentage of the data set, so the impact is lower should that node crash. Plus that helps with storage costs, too.

So let's say all 5 nodes in all DC's are configured like this. Is this performance achievable while maintaining the same level of data integrity/consistency as Galera or better?

It depends on a lot of things, but the consistency level used by all operations is pretty key. In Cassandra, commonly-used consistency levels are ONE, QUORUM, LOCAL_ONE, and LOCAL_QUORUM. Essentially, these levels represent how many data replicas need to respond on either a read or a write, and whether or not any node can respond or just the nodes in our LOCAL_ data center.

The more replicas you are waiting for a response from, the longer an operation is going to take. Waiting for a response from nodes in a non-LOCAL data center is going to take longer, too.

The consistency level usually depends on the business requirements. If an application team was unsure about their approach to consistency, I always coached them to start with LOCAL_ONE and see if they have any problems. Bumping the consistency up to LOCAL_QUORUM is an easy change. Some teams found that writing at LOCAL_QUORUM was good enough for them to still read at LOCAL_ONE.

QUORUM is one of the reasons that RF=3 is such a common replication factor. Quorum is a majority; so half plus one. RF=3 at quorum requires a response from 2 replicas. RF=4 requires 3 replicas. RF=5 requires 3 replicas.

In your case, I'd guess that you're likely to see an increase in performance. This is because local web traffic will be able to use their LOCAL_ data center, instead of being routed to DC1.

If Cassandra replicates asynchronously, would there not be a few milliseconds-second where data could be lost on an outage?

Yes. It's usually around double-digit milliseconds for cloud regions on the same continent. In the case of replicating "across the pond," and down to Australia, it might be a bit longer.

And data wouldn't be "lost" in the event of an outage during replication from one DC to another. It'd still be in the DC which handled the original write. Periodic running of the repair process will make sure that any inconsistent replicas are fixed. If that becomes a problem, then reading at LOCAL_QUORUM should help, as one inconsistent replica shouldn't hinder consensus on a read operation.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • How does Cassandra deal with the "split brain" syndrome? (DC1 and DC2 are both still alive, but the _network_ makes each think the other is down.) – Rick James Sep 12 '22 at 20:03
0

MySQL/MariaDB with Galera

My experience with changing a DNS entry has been underwhelming. I measured one opdate at taking 22 minutes, in spite of a 5 minute TTL. But that was many years ago.

An alternative to DNS is to have some lightweight proxy servers on each server; they will do health checks to see what is alive and redirect the request to the desired node. (No, I don't have a recommendation; there are several.)

It might be best to have the web servers scattered around the world, even if all db activity is directed to a single Primary db server. A customer in New Zealand may hit the web server in Australia more times than it hits the UK db. (Check your code; I may have it backwards.)

Galera between UK and AUS: That's a long way. Each COMMIT requires one roundtrip between the node being written to and each of the other nodes. Note that long hop occurs whether you are writing to a nearby node or a far away node. It is probably faster for you to write to the nearest node, then there is only the one big hop for the COMMIT.

One thing to do in the code is to combine multiple statements into a single Stored Procedure. Then the CALL across the pond happens only once for that set of instructions. This can be a big boost in performance, but may not be practical, depending on how the data is processed. (In my first experience with this technique, I shrank the number of queries from hundreds to 7. That was "good enough", so I stopped there.)

If your traffic continues to grow, I would not add lots of Galera writable nodes (past, say, a total of 5); instead, I would add Replicas hanging off any or all of the Galera nodes. These replicas would be readonly. Your code (or a sophisticated proxy) would need to direct reads to one set of DNS- or proxy-based Write nodes, versus a different set of readonly nodes. This can provide "infinite" scaling of reads. But be aware of the "critical read" issue; that is, don't act on something read from a readonly server without verifying that it is really that way on [any] writable node.

NDB

Caveat: The change to NDB would not be trivial.

NDB "transactions" uses an "eventual consistency" model. For certain applications, it is quite OK to COMMIT 'immediately' and worry about inconsistencies later. One large global app claims "1 second".

NDB allows any number of writable nodes scattered around the world. And it has a lot of redundancy, so data loss is nil.

Rick James
  • 135,179
  • 13
  • 127
  • 222