0

I have a Wordpress/Woocommerce site that I migrated to a new environment. I upgrade my ec2 instance from a legacy m4.10xlarge to a new m5.8xlarge. Some major differences were the old machine was on an old legacy Linux 1 and php 7.2 and the new machine is on Linux 2 with php 7.4 and I made a copy of my database which is on Amazon RDS upgrading it from mysql 5.6 to 5.7. The instance is behind a load balancer which I changed from an old classic load balancer to a new application load balancer.

This environment is working except now the load balancer has a very high connection count and the RDS has spiking database connections. Sometimes they will be from 10 DB connections and spike randomly to 200-400 and then drop back down. During this time the site runs extremely slow and sometimes certain pages will 504 gateway timeout.

enter image description here

This behavior definitely did not exist on my old environment and I have gone through a lot of steps to try and resolve this issue. Ususally my old RDS DB connections would hover around 20 on average. I have spent many hours on the phone with Amazon technical support but they just tell me to speak to different teams and it just goes in circles and ends up with no result.

I have tried tweaking /etc/httpd/conf/httpd.conf file setting certain values that I read or were suggested to me such as:

KeepAlive On
KeepAliveTimeout 5
MaxKeepAliveRequests 500
TimeOut 300
AcceptFilter http none
AcceptFilter https none 
<IfModule mpm_prefork_module>
      StartServers           300
      MinSpareServers        50
      MaxSpareServers        100
      ServerLimit            1000
      MaxRequestWorkers      1000
      MaxConnectionsPerChild 10000
</IfModule>

I have tried tweaking them but to no avail. The connections still spike. I have tried setting values on my RDS parameter group to limit the connections like wait_timeout= 10 interactive_timeout = 60 and net_write_timeout = 60

I even tried to switch from prefork to event module and use php-fpm and fast cgi. Whenever I switched to using this my web pages would rarely work and 50% of the time get a 504 gateway timeout error so I reverted back to using the prefork module.

The last group of settings I tried tuning were some TCP network values in the /etc/sysctl.conf file

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_fin_timeout = 30

# Protect Against TCP Time-Wait
net.ipv4.tcp_rfc1337 = 1

# Decrease the time default value for connections to keep alive
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 60
net.ipv4.tcp_keepalive_intvl = 20

#Increase TCP max buffer size setable using setsockopt():
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432

#Increase Linux autotuning TCP buffer limits min, default, and max number of bytes to use set max to 16MB for 1GE, and 32M or 54M for 10GE:
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432

#Determines how much packet processing can be spent among all NAPI structures registered to a CPU
net.core.netdev_budget = 600

#Increased the number of incoming connections backlog queue. This queue sets the maximum number of packets, queued on the INPUT side
net.core.netdev_max_backlog = 3000000

#Increased the limit of the socket listen() backlog, the maximum value that net.ipv4.tcp_max_syn_backlog can take
net.core.somaxconn = 1000000

Nothing I have tried has reduced these spikes in the database and high connections on the load balancer. When I run top on my server I see sometimes very normal and low load average and then it will get huge going way over my 32 cores on the new machine. I've seen the load average as high as 150 before and then it drops back down.

None of my tweaks or tuning have resulted in anything I can notice looking at nestat top. The results are still the same and the behavior never changes.

If anyone has any idea of what I could try or look into next or any advice at all that would be greatly appreciated

Nick
  • 1,036
  • 2
  • 14
  • 27

1 Answers1

1

How often does that load balancer check the activity on each server? What is the average response time for your app? If the former is longer than the latter, then the balancer is causing the problem.

How many servers are you balancing among?

Beg for "round robin", not something "smart". It is better for low-latency apps.

If, on the other hand, your app is taking 10 seconds or more for any query, then you need to pursue that.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I definitely think you are correct. I started messing with my load balancer and things have actually been changing for the first time since I started this journey. – Nick Jun 03 '21 at 01:35