We have MIG, where only one GCE is present (as our current application does not support auto scaling) , as part of health check policy we use to do TCP on port 22 with 30 sec interval and if consecutive 3 checks failed instance will be treated as unhealthy.
We have faced multiple situation recently with no pattern in time when this health check is getting failed and instance is getting recreated.
The one of the indication which we have that before the instance recreation there was high cpu utilization observed in the VM (~99%) like below
*******************************************
{insertId: "vwsb19f4r1p5f"labels: {activity_type_name: "ViolationOpenEventv1"policy_id: "17302346025290670292"resource_name: "gcp-project-name gcp-gce-name"started_at: "1674896821"terse_message: "CPU utilization for gcp-project-name gcp-gce-name with metric labels {instance_name=gcp-gce-name} **is above the threshold of 0.850 with a value of 0.989**."verbose_message: "CPU utilization for gcp-project-name gcp-gce-name with metric labels {instance_name=gcp-gce-name} **is above the threshold of 0.850 with a value of 0.989**."violation_id: "0.mt6pqza8uq6c"}logName: "projects/gcp-project-name/logs/monitoring.googleapis.com%2FViolationOpenEventv1"receiveTimestamp: "2023-01-28T09:07:01.311681568Z"resource: {2}timestamp: "2023-01-28T09:07:01Z"
******************************************
And after that no logs available in Stackdriver log explorer for this GCE and few min (10/15) later the recreation process starts with below message .
*******************************************
{
"protoPayload": {
"@type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {
"message": "Instance Group Manager 'projects/1048357249635/zones/europe-north1-b/instanceGroupManagers/gcp-project-name-managed-instance-group' initiated recreateInstance on instance 'projects/1048357249635/zones/europe-north1-b/instances/gcp-project-name-bbjn'. **Reason: Instance eligible for autohealing: instance unhealthy.**"
},
"authenticationInfo": {
"principalEmail": "system@google.com"
},
"serviceName": "compute.googleapis.com",
"methodName": "compute.instances.repair.recreateInstance",
"resourceName": "projects/gcp-project-name/zones/europe-north1-b/instances/gcp-project-name-bbjn",
"request": {
"@type": "type.googleapis.com/**compute.instances.repair.recreateInstance**"
}
******************************************
We have tried multiple option like
1) Increase the configuration of the machine from n1-highcpu-4 to n1-highcpu-8
2) The healthchek interval and occurrence has been increased to 1 min and 5 respectively.
But even though no improvement has been observed.
**************************************
We have observation like
1. When the issue arises, before that machine becomes unresponsive and does not allow any new connection like IAP tunnel or SSH from console.
2. Due to the above reason we are not able to monitor or figure out what is the exact reason.
3. Once I was logged in by chance and I was able to see that many process like were taking major CPU, I believe this processes are child process of our ETL job which use to execute BQ Merge SQL.
**/usr/lib64/google-cloud-sdk/platform/bundledpythonunix/bin/python3 /usr/lib64/google-cloud-sdk/bin/bootstrapping/bq.py query --use_legacy_sql=false**
4. Also the no of query in flight and slot utilization became significantly low during that time.
Which also means that traffic from that machine was not able to reach BQ api and jobs get piled up.
5. This GCE is hosted in private VPC in the same projects.
6. We have scheduled many custom/in-house monitoring process to capture the data like top 20 cpu and memory consuming processes and rsync that too GCS bucket, but sadly when the issue occur those job also get stuck I believe as they does not send any data .
*****************
If anyone face same kind of issue or if any idea will , please share , that will be very helpful for us.