I've tried my best to channel this into a specific and digestible question.
I have a low traffic website that has ~1.5m database read/writes per day and 300k monthly human users. My synchronous Galera cluster consists of 2 nodes in the UK at DC1, 2 nodes in US (DC2) and 1 node in Australia (DC4).
All nodes can serve the website, however, all web traffic is routed to one specific node in DC1. If there's an outage, the DNS is updated and web traffic is routed to another node. With low TTL downtime is a minute at most. Each node can handle triple the amount of web traffic I have and probably 10x the database activity without affecting performance. The other nodes are serving other websites so it's better to keep traffic to one server and only use the resources of another if required.
There are weighted load balancing options through AWS or CloudFlare that I could use to take advantage of all the servers but I can't justify the cost and don't believe it's necessary right now. At some point I'd like to route traffic to users closet node but right now that's not in my budge (20 million monthly DNS requests which is predominately bots/crawlers).
When I went from only having 1 server in DC1 to adding another node in DC1 with Galera, write operations were largely unaffected. However, as the website grew and network outages took out both nodes rendering the website unavailable, there'd be significant financial loss after the website came back online for several weeks - increase in bounce rates, drop in checkouts etc. Essentially people would go to competitors and it would take time for the traffic to recover. After it happening consistently after downtime, I understood I have to keep my site up by any means necessary or lose money.
So I added another node in DC2. As soon as I did this, write operations took a noticeable hit in performance. This was expected because of the physical latency but traffic continued to grow inline with projections and I didn't have any complaints about the increase in load times. I then added another node in DC2 which didn't impact performance and eventually one in DC3 which did impact performance. Again, the performance impact didn't affect traffic growth or user satisfaction.
With Galera, I haven't experienced any data loss yet thankfully and I've had network outages many times and hardware failures several times a year. Each time, I've switched to another node/DC and with low TTL DNS records, it's only been a minute of downtime.
Recently my traffic has been growing exponentially and with celebrity endorsements coming up I want to get ahead of any changes that I need to make before they become an issue.
I've been reading up about Cassandra and its ability to scale with all types of hardware and beating performance of Galera and wondering if it's worth diving into this. Here's my understanding so far - please correct me if I'm wrong.
With Cassandra, by default in a 3 node setup, 2 nodes will have a copy of the data but if you have 2x nodes in DC1 and 1x in DC2, there could be potential for data to be inaccessible if DC1 goes down.
Cassandra can be configured so that all nodes contain all data. I read an article that tested the performance of this and the author presented results that showed changes still performed better than MySQL/Galera replication. So let's say all 5 nodes in all DC's are configured like this. Is this performance achievable while maintaining the same level of data integrity/consistency as Galera or better?
If Cassandra replicates asynchronously, would there not be a few milliseconds-second where data could be lost on an outage? If not, how are they able to maintain better performance without impacting consistency?