0

Issue: Writing a query to check all Kubernetes nodes to make sure coredns is running, and if not - has it been for more than 30 minutes since it was? - if so, send an alert.

The alert part will be secondary to my initial question and doesn't have to be addressed on this thread. I just want to figure out how to get this info in the first place.

Essentially: Hey node, do you have a pod named coredns.* running? If no, has it been more than 30m since you did?



My strategy: I assume searching for nodes that do not have a pod name of coredns.* is how I would start.

FROM K8sPodSample SELECT nodeName WHERE podName != 'coredns.*'

Then, set the time frame to be since 31 minutes ago. (Not sure if this shows nodes that have not had the pod on it for 31 minutes or if it shows all pods without it up to 31 minutes ago, even if it's only been a few minutes)

SINCE 31 minute ago

This is a query that will be at the cluster level, so I will add that in as well.

WHERE clusterName = '<clusterName>' 

Then, if this worked properly, I'll generate an alert for any nodes that show up in this list.


Am I thinking about this properly, or could this be accomplished in a better way?



Update: My new strategy is to return a nodeName where the count of pods with coredns in their name is 0...still working this part out.

NayefMusa
  • 11
  • 6

1 Answers1

0

The trick on this one is to look for pods with coredns in the name and that don't have a status of running, grouping by (faceting by) nodeName and namespace.

SELECT uniqueCount(podName) FROM K8sPodSample WHERE namespace NOT LIKE '%kube-system%' AND namespace NOT LIKE %<ourNS>% AND podName LIKE '%coredns%' AND status != 'Running' FACET nodeName, namespace

The only issue I can see arising from this, is if there is no pod with that name at all. It doesn't account for that scenario. It assumes that if the pod is not in working order that it still has some sort of status.

Given the nature of Kubernetes, I think that's fair since it will always try to restart a pod if it's part of a replicaset/deamonset etc. Therefore it will always have a status.

Note: The exclusion of kube-system and 'ourNS' is only included in this query since, in our particular scenario, those are not needed. We are only looking at customer NS's.

NayefMusa
  • 11
  • 6