2

What would be the best way to set up a GCP monitoring alert policy for a Kubernetes CronJob failing? I haven't been able to find any good examples out there.

Right now, I have an OK solution based on monitoring logs in the Pod with ERROR severity. I've found this to be quite flaky, however. Sometimes a job will fail for some ephemeral reason outside my control (e.g., an external server returning a temporary 500) and on the next retry, the job runs successfully.

What I really need is an alert that is only triggered when a CronJob is in a persistent failed state. That is, Kubernetes has tried rerunning the whole thing, multiple times, and it's still failing. Ideally, it could also handle situations where the Pod wasn't able to come up either (e.g., downloading the image failed).

Any ideas here?

Thanks.

joeltine
  • 1,610
  • 17
  • 23

2 Answers2

2

First of all, confirm the GKE’s version that you are running. For that, the following commands are going to help you to identify the GKE’s default version and the available versions too:

Default version.

gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
    --format="yaml(channels.channel,channels.defaultVersion)"

Available versions.

gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
    --format="yaml(channels.channel,channels.validVersions)"

Now that you know your GKE’s version and based on what you want is an alert that is only triggered when a CronJob is in a persistent failed state, GKE Workload Metrics was the GCP’s solution that used to provide a fully managed and highly configurable solution for sending to Cloud Monitoring all Prometheus-compatible metrics emitted by GKE workloads (such as a CronJob or a Deployment for an application). But, as it is right now deprecated in G​K​E 1.24 and was replaced with Google Cloud Managed Service for Prometheus, then this last is the best option you’ve got inside of GCP, as it lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.

Plus, you have 2 options from the outside of GCP: Prometheus as well and Ranch’s Prometheus Push Gateway.

Finally and just FYI, it can be done manually by querying for the job and then checking it's start time, and compare that to the current time, this way, with bash:

START_TIME=$(kubectl -n=your-namespace get job your-job-name -o json | jq '.status.startTime')
echo $START_TIME

Or, you are able to get the job’s current status as a JSON blob, as follows:

kubectl -n=your-namespace get job your-job-name -o json | jq '.status'

You can see the following thread for more reference too.

Taking the “Failed” state as the medullary point of your requirement, setting up a bash script with kubectl to send an email if you see a job that is in “Failed” state can be useful. Here I will share some examples with you:

while true; do if `kubectl get jobs myjob -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' | grep True`; then mail email@address -s jobfailed; else sleep 1 ; fi; done

For newer K8s:

while true; do kubectl wait --for=condition=failed job/myjob; mail@address -s jobfailed; done
  • 1
    TY! Lots of good info here. One of these should meet my needs. – joeltine Mar 16 '22 at 21:34
  • I'm glad to read my answer was useful, my purpose here is to help. Success. – Nestor Daniel Ortega Perez Mar 17 '22 at 18:47
  • 1
    Hmm, so I looked at this in more detail today. Managed Prometheus isn't quite what I need. Prometheus is good for exporting metrics from the workloads themselves, but what I need is something a little higher level. I need a high level alert that a CronJob has failed from the perspective of the Kubernetes scheduler. E.g., k8s may retry a cronjob multiple times and all might fail, putting it into a failed state. It's then and only then I need a single alert. – joeltine Mar 17 '22 at 20:53
  • Taking the “Failed” state as the medullary point of your requirement, setting up a bash script with kubectl to send an email if you see a job that is in “Failed” state can be useful. I edited the answer adding the example codes for that, also for newer K8’s versions. On the other hand, do you want to get your alert when the cron job turns into “Failed” state but only after multiple k8s’ retries? – Nestor Daniel Ortega Perez Mar 17 '22 at 22:56
  • Are you able to do some logging from within your cronjob that indicates a failure? If that's the case, you can simply create an alert policy based on a log query. – Mert Z. Jul 15 '22 at 11:21
  • Also GKE exports kubernetes events of pods to Cloud Logging so again you can create an alert based on the log record which prints the status of a CronJob execution. – Mert Z. Jul 15 '22 at 11:48
1

I was searching for the same solution to monitoring GKE Cronjobs and found this method:

By utilizing GCP's Log Alert feature, we were able to use this following Log query to be notified when a Cronjob's job is considered failed Log Based Alert Doc

resource.labels.cluster_name="CLUSTER_NAME"
resource.type="k8s_cluster"
jsonPayload.source.component="cronjob-controller"
jsonPayload.reason="SawCompletedJob"
"status: Failed"

The sample log is like this

{
  "insertId": "sb6oijf4yi39m",
  "jsonPayload": {
    "type": "Normal",
    "reportingComponent": "",
    "source": {
      "component": "cronjob-controller"
    },
    "lastTimestamp": "2023-04-19T12:05:53Z",
    "metadata": {
      "uid": "0efad02d-c441-4964-b048-496552ecc572",
      "namespace": "default",
      "managedFields": [
        {
          "apiVersion": "v1",
          "time": "2023-04-19T12:05:53Z",
          "manager": "kube-controller-manager",
          "fieldsType": "FieldsV1",
          "operation": "Update",
          "fieldsV1": {
            "f:count": {},
            "f:reason": {},
            "f:firstTimestamp": {},
            "f:type": {},
            "f:source": {
              "f:component": {}
            },
            "f:involvedObject": {},
            "f:lastTimestamp": {},
            "f:message": {}
          }
        }
      ],
      "resourceVersion": "47727088",
      "creationTimestamp": "2023-04-19T12:05:53Z",
      "name": "CRONJOB_NAME.1757548d9eb51a26"
    },
    "message": "Saw completed job: CRONJOB_NAME-28031760, status: Failed",
    "kind": "Event",
    "eventTime": null,
    "involvedObject": {
      "apiVersion": "batch/v1",
      "namespace": "default",
      "kind": "CronJob",
      "name": "CRONJOB_NAME",
      "uid": "6c43108b-14d6-11ea-ac1e-42010af00026",
      "resourceVersion": "1286547540"
    },
    "reportingInstance": "",
    "apiVersion": "v1",
    "reason": "SawCompletedJob"
  },
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "cluster_name": "REDACTED",
      "project_id": "REDACTED",
      "location": "asia-east1-a"
    }
  },
  "timestamp": "2023-04-19T12:05:53Z",
  "severity": "INFO",
  "logName": "projects/REDACTED/logs/events",
  "receiveTimestamp": "2023-04-19T12:05:58.075764494Z"
}
Sean Yuan
  • 143
  • 1
  • 6