3

I am getting an unacceptable low performance with the galera setup i created. In my setup there are 2 nodes in active-active and i am doing read/writes on both the nodes in a round robin fashion using HA-proxy load balancer.

I was easily able to get over 10000 TPS on my application with the single mariadb server with the below configuration: 36 vpcu, 60 GB RAM, SSD, 10Gig dedicated pipe

With galera i am hardly getting 3500 TPS although i am using 2 nodes(36vcpu, 60 GB RAM) of DB load balanced by ha-proxy. For information, ha-proxy is hosted as a standalone node on a different server. I have removed ha-proxy as of now but there is no improvement in performance.

Can someone please suggest some tuning parameters in my.cnf i should consider to tune this severely under-performing setup.

I am using the below my.cnf file:

enter image description here

enter image description here

LakshayK
  • 162
  • 2
  • 8
  • 1
    First of all, this question should be on dba.stackexchange.com. Also, it would be easier to tell us your configuration than to list every possible general performance tip. Generally, a galera cluster will have slower write/transaction performance than a single instance (as the nodes have to communicate). 35% is lower than expected, but it depends on what you are actually doing (table design/queries) to tell if it is something in the configuration or not. Also be aware that a two-node-cluster actually increases the possibility that the cluster fails (because if any of those 2 fail, both fail). – Solarflare Jan 12 '17 at 11:36
  • Thanks Solarflare, will post this on dba.stackexchange.com. Also i am providing my my.cnf parameters in the question. – LakshayK Jan 12 '17 at 12:16

2 Answers2

2

I was easily able to get over 10000 TPS on my application with the single mariadb server with the below configuration: 36 vpcu, 60 GB RAM, SSD, 10Gig dedicated pipe

With galera i am hardly getting 3500 TPS although i am using 2 nodes(36vcpu, 60 GB RAM) of DB load balanced by ha-proxy.

Clusters based on Galera are not designed to scale writes as I see you intend to do; In fact, as Rick mentioned above: sending writes to multiple nodes for the same tables will end up causing certification conflicts that will reflect as deadlocks for your application, adding huge overhead.

I am getting an unacceptable low performance with the galera setup i created. In my setup there are 2 nodes in active-active and i am doing read/writes on both the nodes in a round robin fashion using HA-proxy load balancer.

Please send all writes to a single node and see if that improves performane; There will always be some overhead due to the nature of virtually-synchronous replication that Galera uses, which literally adds network overhead to each write you perform (albeit true clock-based parallel replication will offset this impact quite a bit, still you are bound to see slightly lower throughput volumes).

Also make sure to keep your transactions short and COMMIT as soon as you are done with an atomic unit of work, since replication-certification process is single-threaded and will stall writes on the other nodes (if you see that your writer node shows transactions wsrep pre-commit stage that means the other nodes are doing certification for a large transaction or that the node is suffering performance problems of some sort -swap, full disk, abusively large reads, etc.

Hope that helps, and let us know how it goes when you move to single node.

MarcosAlbe
  • 21
  • 2
  • 1
    actually on the [product page of the galera cluster website](http://galeracluster.com/products/), they make it very clear that it is now intended to scale writes – the beest Jun 03 '17 at 09:27
1

Turn off the QC:

query_cache_size = 0  -- not 22 bytes
query_cache_type = OFF -- QC is incompatible with Galera

Increase innodb_io_capacity

How far apart (ping time) are the two nodes?

Suggest you pretend that it is Master-Slave. That is, have HAProxy send all traffic to one node, leaving the other as a hot backup. Certain things can run faster in this mode; I don't know about your app.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Hi rick, the nodes are within the same vpc and same subnets and i guess the ping time is less than 50 ms. The only purpose for me to use galera is to see if i can utilise the cpus of both the machines while keeping common data, so that my application can read and write virtually on a single set of data. I have a feeling that galera may not be the best option for me to do this. What do you think? Any suggestions, how about NDB? – LakshayK Jan 14 '17 at 19:31
  • 1
    50ms is about 1500 miles / 2500km. 50ms would noticeably cut back on performance. Were you actually consuming more than half the cores on one machine? And keep in mind that come CPU processing is needed to handle replication on the receiving end. – Rick James Jan 14 '17 at 21:43
  • 1
    NDB has a lot of other issues, including non-trivial changes in what can/cannot be done in SQL. – Rick James Jan 14 '17 at 21:44
  • What do the queries look like? Do they mostly hit one table, or many? Mostly reads? Or writes? Point queries vs table scans? What are the 'worst' queries according to the slowlog? How many connections? – Rick James Jan 14 '17 at 21:46
  • A single connection that is nearly maxing out the server _will_ run slower with Galera. Hundreds of connections touching dozens of tables _should_ run faster with Galera. Etc. – Rick James Jan 14 '17 at 21:48
  • 1
    Note that the comments in the recommended config changes are no longer accurate. Galera now supports query cache, as of MariaDB versions "5.5.40-galera", "10.0.14-galera" and "10.1.2". https://mariadb.com/kb/en/library/query-cache/#limitations – Dave Sherohman Jun 18 '18 at 14:00
  • 1
    @DaveSherohman - Hmmm.. That reference page is a big vague. Even if the QC is supported, I wonder whether it has extra overhead in a Galera environment. For 90+% of production systems is is actually better to turn off the QC. – Rick James Jun 18 '18 at 19:01