0

We have MIG, where only one GCE is present (as our current application does not support auto scaling) , as part of health check policy we use to do TCP on port 22 with 30 sec interval and if consecutive 3 checks failed instance will be treated as unhealthy.

We have faced multiple situation recently with no pattern in time when this health check is getting failed and instance is getting recreated.

The one of the indication which we have that before the instance recreation there was high cpu utilization observed in the VM (~99%) like below

*******************************************
{insertId: "vwsb19f4r1p5f"labels: {activity_type_name: "ViolationOpenEventv1"policy_id: "17302346025290670292"resource_name: "gcp-project-name gcp-gce-name"started_at: "1674896821"terse_message: "CPU utilization for gcp-project-name gcp-gce-name with metric labels {instance_name=gcp-gce-name} **is above the threshold of 0.850 with a value of 0.989**."verbose_message: "CPU utilization for gcp-project-name gcp-gce-name with metric labels {instance_name=gcp-gce-name} **is above the threshold of 0.850 with a value of 0.989**."violation_id: "0.mt6pqza8uq6c"}logName: "projects/gcp-project-name/logs/monitoring.googleapis.com%2FViolationOpenEventv1"receiveTimestamp: "2023-01-28T09:07:01.311681568Z"resource: {2}timestamp: "2023-01-28T09:07:01Z"
******************************************

And after that no logs available in Stackdriver log explorer for this GCE and few min (10/15) later the recreation process starts with below message .

*******************************************

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "message": "Instance Group Manager 'projects/1048357249635/zones/europe-north1-b/instanceGroupManagers/gcp-project-name-managed-instance-group' initiated recreateInstance on instance 'projects/1048357249635/zones/europe-north1-b/instances/gcp-project-name-bbjn'. **Reason: Instance eligible for autohealing: instance unhealthy.**"
    },
    "authenticationInfo": {
      "principalEmail": "system@google.com"
    },
    "serviceName": "compute.googleapis.com",
    "methodName": "compute.instances.repair.recreateInstance",
    "resourceName": "projects/gcp-project-name/zones/europe-north1-b/instances/gcp-project-name-bbjn",
    "request": {
      "@type": "type.googleapis.com/**compute.instances.repair.recreateInstance**"
    }

******************************************

We have tried multiple option like 
1) Increase the configuration of the machine from n1-highcpu-4 to n1-highcpu-8
2) The healthchek interval and occurrence has been increased to 1 min and 5 respectively. 

But even though no improvement has been observed.
**************************************
We have observation like 
1. When the issue arises, before that machine becomes unresponsive and does not allow any new connection like IAP tunnel or SSH from console.
2. Due to the above reason we are not able to monitor or figure out what is the exact reason.
3. Once I was logged in by chance and I was able to see that many process like were taking major CPU, I believe this processes are child process of our ETL job which use to execute BQ Merge SQL.

**/usr/lib64/google-cloud-sdk/platform/bundledpythonunix/bin/python3 /usr/lib64/google-cloud-sdk/bin/bootstrapping/bq.py query --use_legacy_sql=false**

4. Also the no of query in flight and slot utilization became significantly low during that time.
Which also means that traffic from that machine was not able to reach BQ api and jobs get piled up.

5. This GCE is hosted in private VPC in the same projects.  
6. We have scheduled many custom/in-house monitoring process to capture the data like top 20 cpu and memory consuming processes and rsync that too GCS bucket, but sadly when the issue occur those job also get stuck I believe as they does not send any data .
*****************
If anyone face same kind of issue or if any idea will , please share , that will be very helpful for us.
Souvik Das
  • 35
  • 6
  • Can someone please give some hint about what could be the possible reason for the same – Souvik Das Feb 15 '23 at 07:53
  • Did you had time to check my answer? It helped you to solve your problem? If yes, Please consider to accept and upvote it. I am happy to help if you have any further queries. – Veera Nagireddy Feb 24 '23 at 08:11
  • Have you tried disabling the Health Check on the MIG temporarily, to make sure you can collect the data from monitoring (and e.g. dump it to the GCS bucket once the VM starts being responsive)? – Grzenio Feb 27 '23 at 09:52
  • Hello @Souvik Das, Feel free to update the status of the question. Let me know the answer below helps to resolve your issue? I am happy to help you if you have any further queries. – Veera Nagireddy Mar 28 '23 at 07:11

1 Answers1

0

Try below possible solutions :

  1. Try enabling autohealing to repair unhealthy instances, to know about how a MIG automatically repairs VMs, see About repairing VMs in a MIG. With a MIG with instances that don't have external IPs, you can whitelist access to your service from subnets as per the official gcp Firewall rules for health checks documentation.

  2. Check if you have a default allow SSH rule? If not, you have a rule that allows access to SSH from the IAP range, see official gcp documentation on how to Create a firewall rule for more information.

  3. Check that your boot disk may be full. There may be no disk space for the keys that are automatically transferred at the beginning of the SSH session. See how to Troubleshooting SSH errors (If VM's boot disk is full)

  4. Connecting via serial console and cleaning up disk space may help to resolve your issue. Use df- h to see entire consumption (incl. temp files) and du -h to find the largest files / folders.

  5. The last solution is to create an Image with the latest VM Snapshot and then you have to delete the VM and create it again with this Image.

Veera Nagireddy
  • 1,656
  • 1
  • 3
  • 12