502 errors are due to healthcheck setup or resource exhaustion

Question

My setup is a bitnami wordpress hosted on GCP's N2-standard-2 VM. I'm using a HTTPS load balancer and CDN.

I encountered the 502 errors a few times ever since I configured a load balancer. I was doing quite a bit of seo and page scanning tests when this happened.

I've checked that the VM is only using 8-12% of the disk capacity. The log shows CPU Max usage is 9.62%. I've to restart the VM to resolve the error.

What are the cause of the 502 errors

Could it be due to the traffic spike from third party scanning sites?
Is it because of my health check configuration?
Do I have to change a machine type and increase the memory?

What should I look into to troubleshoot it?

This is my healthcheck setup This is my healthcheck setup

The server was down again and this time round I managed to look for the information you have suggested.

The error is not from Load Balancer
The error is from VM and the error message is: "Error watching metadata: Get http://169.254.169.254/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=ag92d16ff423b06: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
VM disk size is 100GB. Machine Type is N2-standard-2 VM
It is a Wordpress Instance
Everything is within Quota
Incidents happen on a few occasions:
- when I use third party site to scan the website for deadlinks. After the scan is completed, the server will go down shortly after. I have to reboot the instance to make it functional again.
- It happens randomly and recover by itself after a while

Thanks everyone for your help. I just managed to figure out how to retrieve the other required info.

I was wrong that the load balancer didn't report any errors.

Below is from Logging

From Loadbalancer : Client disconnected before any response
From Loadbalancer: 502 - failed_to_pick_backend
From Unmanaged Instance Group: Timeout waiting for data and HTTP Response Internal server error

I tried to increase the Load Balancer timeout duration, the VM stills shut down and rebooted on its own. Sometimes it takes a few minutes to recover and sometimes it takes about an hour plus.

I provided some screenshots which recorded the recent incident from 8.47 to 8.54.

Below is from Monitoring

Who is reporting the 502 error, the load balancer, or the backend VM? What is the request that caused the 502 error? What is the VM size and configuration? Are quotas being exceeded for disk IOPS, etc? Details are required to know otherwise we can only guess. — John Hanley, Feb 09 '22 at 18:21
Also check what the "statusDetail" field says for the 502 errors. It might give a clue on what the issue is. You can find more info here: https://cloud.google.com/load-balancing/docs/https/troubleshooting-ext-https-lbs#unexplained_502_errors — Erhard Czving, Feb 10 '22 at 12:11
Please enable healthcheck logs, lower 'healthy threshold' to 2 checks and post results. This will help narrow down the issue. — Sergiusz, Feb 15 '22 at 11:33
I do not understand why a failure to read the VM metadata would result in a 502 unless the CPU or network was maxed out at that instant. I do not think you have found the problem. — John Hanley, Feb 20 '22 at 16:26
Hi @JohnHanley That is the only error reported from Logging. I didn't see any other errors. I checked that CPU usage is less than 20% after it is functional. I didn't know how to check the CPU usage when the server goes down. How or where can I check that? — jollysea, Feb 20 '22 at 22:52
Please define what "the server goes down" means. Your server will have logs, review them to find the source of the problem. The Google metadata server is not the problem but may be a symptom caused by the real problem. — John Hanley, Feb 20 '22 at 23:57
Thanks @Sergiusz enabled the healthcheck logs and lowered the healthy threshold to 2 as suggested. — jollysea, Feb 24 '22 at 01:40
Thanks @ErhardCzving. Just managed to figure out where to look for the statusDetail. It is `is failed_to_pick_backend` — jollysea, Feb 24 '22 at 01:41
Thanks @JohnHanley I most probably use inappropriate term. By "server goes down" I meant the website is not accessible. — jollysea, Feb 24 '22 at 01:44
Google Cloud virtual machines do not shut down and reboot on their own. That reboot process will be logged. The only item I see is that CPU utilization spiked to an abnormally high value which can cause thrashing. Try a larger instance size (2x) and repeat the test that caused the problem. You can resize the instance size back once the test runs. — John Hanley, Feb 24 '22 at 01:56
Thanks @JohnHanley I just realised I have the automatic restart option checked, hence the auto restart after failure. I tested with larger instance as you suggested. I tried with 1x larger, the server still shut down but not as often and the recovery time is faster. I tried with 2x larger as you recommended, after a week plus of monitoring, there isn't any incidents of downtime. Without trials and errors, is there a way to gauge what size is best suited for our needs? — jollysea, Mar 06 '22 at 06:08

502 errors are due to healthcheck setup or resource exhaustion

0 Answers0