-1

I have a virtual machine with 64 vCPUs and 256GB of memory. Recently, I decided to perform some stress tests on the website that is running on this virtual machine. The entire VM is only for this website.

The first test I ran was with 20,000 users per second, and the average response time was around 1400ms. During the test, the site was not usable.

enter image description here

After that, I decided to inspect the top processes to identify the source of the problem. These are the processes and their CPU utilization during the test:

top - 10:30:19 up 1 day, 34 min,  0 users,  load average: 8.39, 3.04, 1.46
Tasks: 711 total,   2 running, 709 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.0 us,  9.8 sy,  3.8 ni, 79.2 id,  0.2 wa,  0.0 hi,  0.9 si,  0.0 st
MiB Mem : 257925.6 total, 219425.1 free,   3658.2 used,  34842.3 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used. 252346.8 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                        
218159 mysql     20   0 6911232  96204  19792 S 491.4   0.0   4:24.99 mysqld                                         
139405 nobody    20   0   54948  34196   6128 D  44.9   0.0   0:52.17 litespeed                                      
218251 obl74+  21   1  347708  29228  19328 S  40.9   0.0   0:20.83 lsphp                                          
218402 obl74+  21   1  347708  29152  19264 S  40.9   0.0   0:22.35 lsphp                                          
218955 obl74+  21   1  273004  21336  12472 D  40.9   0.0   0:22.39 lsphp                                          
218957 obl74+  21   1  273004  21336  12472 D  40.9   0.0   0:22.22 lsphp                                          
218961 obl74+  21   1  273004  21336  12472 S  40.9   0.0   0:22.37 lsphp                                          
218963 obl74+  21   1  273004  21328  12468 S  40.9   0.0   0:22.31 lsphp                                          
218252 obl74+  21   1  347708  29228  19328 D  40.5   0.0   0:22.42 lsphp                                          
218407 obl74+  21   1  347708  29152  19264 D  40.5   0.0   0:22.30 lsphp                                          
218956 obl74+  21   1  273004  21332  12472 S  40.5   0.0   0:20.73 lsphp                                          
218959 obl74+  21   1  273004  21336  12472 S  40.5   0.0   0:22.13 lsphp 

Interestingly, despite the website's poor performance during the test, neither the CPU nor memory usage was particularly high. Also, during the test, CyberPanel indicated a CPU usage of 19% and a memory usage of 2%. Therefore, I conclude that the server is not experiencing any resource constraints, as it is not utilizing all its CPU and memory. However, it is still lagging for some reason.

Then, I decided to remove the components related to MySQL from the page on which I performed the stress test. The outcome was much more stable.

enter image description here

top - 10:43:54 up 1 day, 47 min,  0 users,  load average: 0.87, 1.23, 1.41
Tasks: 705 total,   5 running, 699 sleeping,   0 stopped,   1 zombie
%Cpu(s):  2.8 us,  1.0 sy,  0.4 ni, 95.2 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
MiB Mem : 257925.6 total, 218249.7 free,   3910.0 used,  35765.9 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used. 252098.9 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                        
139416 nobody    20   0   53200  32480   6128 S  18.3   0.0   0:47.00 litespeed                                      
139402 nobody    20   0   52928  33308   7204 S  16.6   0.0   0:44.40 litespeed                                      
139409 nobody    20   0   54900  34136   6188 S  16.6   0.0   0:46.38 litespeed                                      
139410 nobody    20   0   49904  29156   6128 S  16.6   0.0   0:35.43 litespeed                                      
139414 nobody    20   0   51688  30936   6128 R  16.6   0.0   0:45.46 litespeed                                      
139415 nobody    20   0   55492  35280   6680 R  15.9   0.0   0:46.24 litespeed                                      
139412 nobody    20   0   52112  31420   6188 S  15.6   0.0   0:45.05 litespeed                                      
139404 nobody    20   0   50396  29644   6128 S  15.3   0.0   0:44.83 litespeed                                      
139413 nobody    20   0   44700  23816   6128 S  15.3   0.0   0:21.83 litespeed                                      
139406 nobody    20   0   50752  30004   6128 S  15.0   0.0   1:05.25 litespeed 

According to CyberPanel, during the new test, the CPU usage was 4% and the memory usage was 2%.

Therefore, it is obvious that there is an issue with MySQL. I am currently using the default my.cnf configuration provided by CyberPanel, but I have attempted various other configurations found on the internet, yet nothing has improved the performance even a little bit. I've also tried stuff like MySQL Tuner but it didn't change the performance.

The MySQL part that I removed for the second test was a basic query of a table that contained 7 rows. It verified the user's IP address to determine if they were on the IP whitelist. This operation should not have posed a significant problem.

As observed in both tests, I detect a threshold or a bottleneck at the start, beyond which the site experiences a sharp increase in lag. Despite having ample free memory and CPU, there seems to be some limiting factor.

Some might argue that a rate of 20,000 users per second is excessive and unrealistic. However, even when I conducted the test with only 250 users per second, the outcome was the same: the website was extremely slow and not usable.

enter image description here

At this point I am totally lost. I am uncertain as to where to focus my efforts and what steps to take next to decrease the average response time. I would greatly appreciate any insightful comments or suggestions you may have and I thank you in advance for your time and consideration.

UPDATE

I have reinstalled the operating system and CyberPanel, and it appears that the problem has been resolved. Although I am uncertain about what went wrong previously, I suspect that an incorrect setting was responsible.

  • This is likely to get closed, as you're not asking a specific question. I would suggest: - ramping up the number of concurrent users much more slowly. That should give you an idea of the point where you're hitting a bottleneck. - running an EXPLAIN on the MySQL query - "This operation should not have posed a significant problem." is an assumption worth checking... – Neville Kuyt Feb 03 '23 at 11:36
  • Thank you for your response. My main goal is to optimize the configuration file for MySQL, my.cnf. As I mentioned earlier, I have tried using MySQL Tuner, but it did not result in any performance improvement. It is possible that I may have missed a cache setting or some other relevant configuration. – iristhennars Feb 03 '23 at 11:51
  • cyberpanel has default , 10 PHP concurrency , you may need to raise that as well. – qtwrk Feb 03 '23 at 12:18
  • Thank you for pointing that out. I tried adjusting the value to 100-1000-10000, but unfortunately, there was no improvement. – iristhennars Feb 03 '23 at 12:37
  • I encourage you to delete MySQL Tuner and forget it ever existed. It frequently gives wrong advice. See some of my past answers about it for details: https://stackoverflow.com/search?q=user%3A20860+mysqltuner – Bill Karwin Feb 03 '23 at 15:26
  • 1
    There's no way to give specific advice to your question. I would guess your SQL queries are causing lots of I/O, so the CPU is just waiting on that. You may need to optimize queries by creating indexes, and you may need to increase `innodb_buffer_pool_size` if your data is much larger than the default buffer pool. The default size of the BP is quite undersized for a production site. But this is just an educated guess. You need to identify which queries are your bottleneck. – Bill Karwin Feb 03 '23 at 15:34
  • Additional DB information request, please. Run the test for 250 users - 10 m RAM size, # cores, any SSD or NVME devices on MySQL Host server? Post TEXT data on justpaste.it and share the links. From your SSH login root, Text results of: A) SELECT COUNT(*), sum(data_length), sum(index_length), sum(data_free) FROM information_schema.tables; B) SHOW GLOBAL STATUS; after 250 users - 10 minutes. C) SHOW GLOBAL VARIABLES; D) SHOW FULL PROCESSLIST; E) STATUS; not SHOW STATUS, just STATUS; G) SHOW ENGINE INNODB STATUS; for server workload tuning analysis to provide suggestions. – Wilson Hauck Feb 03 '23 at 20:51
  • Post TEXT data on justpaste.it and share the links. Additional very helpful OS information includes - please, htop 1st page, if available, TERMINATE, top -b -n 1 for most active apps, top -b -n 1 -H for details on your mysql threads memory and cpu usage, ulimit -a for list of limits, iostat -xm 5 3 for IOPS by device & core/cpu count, df -h for Used - Free space by device, df -i for inode info by device, free -h for Used - Free Mem: and Swap:, cat /proc/meminfo includes VMallocUused, for server workload tuning analysis to provide suggestions. – Wilson Hauck Feb 03 '23 at 20:52
  • 1
    @WilsonHauck current config: https://justpaste.it/7nnkm | server runs on ssd: https://justpaste.it/88r74 | A) https://justpaste.it/5ef5n | B) https://justpaste.it/ddml3 | C) https://justpaste.it/36069 | D) https://justpaste.it/c3d5j | E) https://justpaste.it/2ir3d | G) https://justpaste.it/a6r4l – iristhennars Feb 03 '23 at 21:18
  • 1
    @WilsonHauck top -b -n 1 -H https://pastebin.com/pZdgyC82 | ulimit -a https://pastebin.com/8R3WJV3A | df -h https://pastebin.com/t5AaX2TN | df -i https://pastebin.com/Gz21PqUk | free -h https://pastebin.com/JeTYCsWC | cat /proc/meminfo https://pastebin.com/SbB3wCsj – iristhennars Feb 03 '23 at 21:27
  • @iristhennars Thanks for posting your data. workload analysis in process. I hope to have Answer with suggestions posted before Monday AM. – Wilson Hauck Feb 04 '23 at 01:50
  • 1
    @WilsonHauck I just migrated the database used for the queries to CloudSQL. As a result, the page I used for the stress test is currently not connected to my own database. Despite this, there are still significant fluctuations in performance. Is this normal behavior? I am beginning to suspect that the issue may not be related to my database, but rather to some other aspect of the server. (result: https://i.imgur.com/Yl5BxqV.png | top processes during the test: https://pastebin.com/YFZcjCyq) – iristhennars Feb 04 '23 at 12:03
  • Those look like blazemeter screenshots, what does '20,000 users per second' mean? Is that 20k concurrent users, or 20k requests per second? Is it a test with 20k users doing exactly the same thing, what are these requests doing? I suspect you're looking at this the wrong way, focussing on how to optimize without addressing how the app works. E.g. is that 20k users concurrently asking for exactly the same thing from the DB? Irrespective, the question is too broad :). – AD7six Feb 05 '23 at 21:32
  • It needs tuning. There's no way to know. Connection pooling, indexes, tweaking the db config, rewriting queries. Are you getting deadlocks? Using INNODB tables? What kind of table locking, etc... – Garr Godfrey Feb 05 '23 at 21:33
  • Is there a reason we do NOT see a daemon in the TOP report of 3/4/2023? Normally we see a mysqld for the daemon processing client requests. – Wilson Hauck Feb 05 '23 at 22:17

2 Answers2

1

Suggestions to consider for your CloudSQL configuration

innodb_buffer_pool_size=8G  #  from ~ 192G because current data is less than 1G
innodb_io_capacity=500  #  from 200 to utilize more of your SSD IOPS
innodb_lru_scan_depth=100  # from 1024 to conserve 90% CPU cycles used every second for function
key_buffer_size=20M  # from ~ 128M needed for tmp tbl management, NO MyISAM tbls
sql_log_bin=0  # from ON unless you have a need for this specific log

Please view profile for contact info. Other performance enhancements available.

Wilson Hauck
  • 2,094
  • 1
  • 11
  • 19
  • 1
    Hello Wilson! Thank you for your response. I have reinstalled the operating system and CyberPanel, and it appears that the problem has been resolved. Although I am uncertain about what went wrong previously, I suspect that an incorrect setting was responsible. Upon the new installation, I implemented your suggestions and the performance appears to have improved. Thank you for your help! – iristhennars Feb 06 '23 at 13:22
  • View profile for contact info, please. – Wilson Hauck Feb 22 '23 at 11:29
0

For 20K users per second, you need multiple servers and switches in front of them. Period. End of discussion.

Well, OK, I'll discuss it further.

When MySQL is presented with lots of 'simultaneous' users, it plays fair with them -- each is given equal access to all resources. This is fine until it falls off the cliff. This is when most of the processing is dealing with sharing of the resources. All the threads will eventually finish, but each will take a long time and you (the DBA) will think it crashed and pull the plug.

A simple cure is to lower (YES, lower) the value of max_connections. It turns out that the "cliff" is at a few dozen connections.

Is benchmarking, one throws as much stuff at the server until it croaks. That's usually a few dozen.

In real life web pages are not doing 100% database operations, they are letting the user react, building pages, etc. So, a max_connections of a few hundred is realistic.

Once it reaches the cliff, latency goes through the roof. You would expect throughput to increase, too, but it decreases slightly. I believe that this is because the threads are stumbling over each other too much. Think about any "cache" (buffer_pool, open_tables, table_definitions, etc) -- if "too many" threads are running, the caches may become ineffective.

Think about a market with so many shoppers that they spend most of their time juggling around other people. More shoppers per hour can get through the market if they keep shoppers from entering when it is "full". max_connections is that limiter.

INDEXes needed

ALTER TABLE table_name ADD INDEX(zone);
ALTER TABLE table_name ADD INDEX(IPPool);

(Then take a crash course in the benefits of Indexes (aka "KEYs").)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 1
    `[mysqld] innodb_buffer_pool_size=184320M innodb_log_buffer_size=16M innodb_flush_log_at_trx_commit=0 thread_cache=64 query_cache_size=0 join_buffer_size=512K table_cache=100000` This doesn't seem to fix the issue, but this my.cnf seems to work best for now. I have absolutely no experience with database optimization, so the configuration might be silly. Any recommendations? I also tried to lower the number of users per second and it seems that the first spike is after 54 users per second. Interestingly, this value is almost constant. – iristhennars Feb 03 '23 at 20:51
  • How many _queries_ per second? Are they fast queries? Does a "user" do more than one query before disconnecting? (I need a better feel for the processing to advise you further.) Benchmarks I have seen indicate that 54 users continually pounding on the db _may_ be enough to bring MySQL to its knees. OTOH, 54 users doing one trivial query, then exiting -- well that is not much burden. – Rick James Feb 03 '23 at 23:40
  • 1
    Each user starts a MySQL connection and makes 2 queries. They are really specific queries. The first gets the value if the zone is under maintenance and the second gets the IP Pool to check if the user's IP is whitelisted. So only 2 queries in total. – iristhennars Feb 04 '23 at 11:20
  • 1
    Hold on...I just migrated the database used for the queries to CloudSQL. As a result, the page I used for the stress test is currently not connected to my own database. Despite this, there are still significant fluctuations in performance. Is this normal behavior? I am beginning to suspect that the issue may not be related to my database, but rather to some other aspect of the server. (result: https://i.imgur.com/Yl5BxqV.png | top processes during the test: https://pastebin.com/YFZcjCyq) – iristhennars Feb 04 '23 at 12:02
  • Does the stress tester use those actual queries against your actually db? Are the "point" queries -- That is a look up by the PRIMARY KEY or at least a single row based on a secondary INDEX? Looks like a lower bound of 100ms -- is the client on the other side of the world from the server? Or are the queries complex? – Rick James Feb 04 '23 at 16:31
  • How many milliseconds between connecting to disconnecting? Any form of "connection pooling"? – Rick James Feb 04 '23 at 16:33
  • 1
    Test settings: https://i.imgur.com/6HPp0uP.png | Queries are just these two: "SELECT FROM maintenance WHERE zone='$zone'" and "SELECT FROM maintenance WHERE zone='IPPool'". There is no such data about where the clients are but I assume the server is based in the EU because the response time in the beginning is really low. – iristhennars Feb 04 '23 at 16:49
  • And `zone` is indexed? – Rick James Feb 04 '23 at 19:10
  • Does each client need to hang on for 1 minute? Seems like it would/should go away after milliseconds. – Rick James Feb 04 '23 at 19:12
  • 1
    This is what the database looks like: https://i.imgur.com/fPVQVXS.png with the test, yes every client hangs on for 1 minute on the website. By the way, thank you for your interest in the problem. What do you think about the thing that I wrote about CloudSQL? Why do you think there are still fluctuations in the average response time even when the database is somewhere else? – iristhennars Feb 04 '23 at 21:06
  • 1
    Any Cloud service is likely to have other users on the same hardware. Therefore, when they are busy _and_ you are busy, slowdowns can occur. – Rick James Feb 05 '23 at 03:12
  • Please post TEXT results of A) SHOW CREATE TABLE maintenance; AND B) SHOW TABLE STATUS WHERE name LIKE "maintenance"; for analysis (looking for existing indexes) and cardinality information. – Wilson Hauck Feb 05 '23 at 21:12