Kubernetes placement causes pods to restart forever

Question

We have 2 nodes, each with 96 GB RAM. The plan was that our pods will take 90.5 GB RAM from one of the nodes and 91 GB from the other. What actually happened was the pods took 93.5 GB from one of the nodes and 88 GB from the other. This caused the pods to just restart forever and the application never reached running state.

background: We are new to kubernetes and using version 1.14 on an eks cluster on AWS (v1.14.9-eks-658790). Currently we have pods of different sizes that together make 1 unit of our product. On the testing setup we want to work with 1 unit, and on production with many. It is a problem for us to pay more money for nodes, reduce the pod limits or the number of copies.

Details on the pods:

+-------------+--------------+-----------+-------------+
|  Pod name   | Mem requests | pod limit | # of copies |
+-------------+--------------+-----------+-------------+
| BIG-OK-POD  | 35           | 46        | 2           |
| OK-POD      | 7.5          | 7.5       | 4           |
| A-OK-POD    | 6            | 6         | 8           |
| WOLF-POD    | 5            | 5         | 1           |
| WOLF-B-POD  | 1            | 1         | 1           |
| SHEEP-POD   | 2            | 2         | 1           |
| SHEEP-B-POD | 2            | 2         | 1           |
| SHEEP-C-POD | 1.5          | 1.5       | 1           |
+-------------+--------------+-----------+-------------+

We don't care where the pods run, we just want the node to be able to handle the memory requirements without failing.

I renamed the pods to make it easier to follow what we expected.

Expected placement:

We expected the the wolf pods will be on one node, and the sheep pods on the other, while the OK pods will be splitted up between the nodes.

Node 1:
+-------------+-----------+-------------+----------------+
|  Pod name   | pod limit | # of copies | combined limit |
+-------------+-----------+-------------+----------------+
| BIG-OK-POD  | 46        | 1           |             46 |
| OK-POD      | 7.5       | 2           |             15 |
| A-OK-POD    | 6         | 4           |             24 |
| WOLF-POD    | 5         | 1           |              5 |
| WOLF-B-POD  | 1         | 1           |              1 |
+-------------+-----------+-------------+----------------+
|                                       | TOTAL: 91      |
+-------------+-----------+-------------+----------------+

Node 2:

+-------------+-----------+-------------+----------------+
|  Pod name   | pod limit | # of copies | combined limit |
+-------------+-----------+-------------+----------------+
| BIG-OK-POD  | 46        | 1           | 46             |
| OK-POD      | 7.5       | 2           | 15             |
| A-OK-POD    | 6         | 4           | 24             |
| SHEEP-POD   | 2         | 1           | 2              |
| SHEEP-B-POD | 2         | 1           | 2              |
| SHEEP-C-POD | 1.5       | 1           | 1.5            |
+-------------+-----------+-------------+----------------+
|                                       | TOTAL: 90.5    |
+-------------+-----------+-------------+----------------+

Actual placement:

Node 1:
+-------------+-----------+-------------+----------------+
|  Pod name   | pod limit | # of copies | combined limit |
+-------------+-----------+-------------+----------------+
| BIG-OK-POD  | 46        | 1           | 46             |
| OK-POD      | 7.5       | 2           | 15             |
| A-OK-POD    | 6         | 4           | 24             |
| WOLF-POD    | 5         | 1           | 5              |
| SHEEP-B-POD | 2         | 1           | 2              |
| SHEEP-C-POD | 1.5       | 1           | 1.5            |
+-------------+-----------+-------------+----------------+
|                                       | TOTAL: 93.5    |
+-------------+-----------+-------------+----------------+

Node 2:
+-------------+-----------+-------------+----------------+
|  Pod name   | pod limit | # of copies | combined limit |
+-------------+-----------+-------------+----------------+
| BIG-OK-POD  | 46        | 1           | 46             |
| OK-POD      | 7.5       | 2           | 15             |
| A-OK-POD    | 6         | 4           | 24             |
| WOLF-B-POD  | 1         | 1           | 1              |
| SHEEP-POD   | 2         | 1           | 2              |
+-------------+-----------+-------------+----------------+
|                                       | TOTAL: 88      |
+-------------+-----------+-------------+----------------+

Is there a way to tell kubernetes that the Node should leave 4 GB of memory to the node itself?

After reading Marc ABOUCHACRA answer, we tried changing the system-reserved=memory (which was set to 0.2Gi), but for any values higher than 0.3Gi (0.5Gi, 1Gi, 2Gi, 3Gi and 4Gi), pods were stuck on pending state forever.

Update: We found a way to reduce the limit on a few of the pods and now the system is up and stable (even though 1 of the nodes is on 99%). We couldn't get K8s to start with previews config and we still don't know why.

You only mention pod limit, have you defined resource requests? alos, can you share the pod yaml definition? — wolmi, Sep 10 '20 at 08:01
have you already been considering NodeAffinity/PodAffinity? https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ — Nick, Sep 10 '20 at 12:43

score 2 · Accepted Answer · answered Sep 10 '20 at 09:00

2

Kubernetes let's you configure the server in order to reserve resources for system daemons.

To do that, you need to configure the kubelet agent. This is a per/node configuration.
So if you want to reserve 4Gb of memory on one node, you need to configure the kubelet agent on this node with the following option :

--system-reserved=memory=4Gi

You can find out more about that in the official documentation

answered Sep 10 '20 at 09:00

Marc ABOUCHACRA

3,155
12
19

This is what I was looking for. However we tested it today with a rage of different values and for most, the application failed to start. Pods were stuck on pending state forever. Application did start when we set the system reserved memory to 0.3Gi, but placement was not optimal. The values which caused the app to stay pending forever are 0.5Gi, 1Gi, 2Gi, 3Gi and 4Gi. – mgershen Sep 15 '20 at 10:48
Hmm that's weird. What was the cause of the pending state ? Did you look at it with a `describe` on the pod ? – Marc ABOUCHACRA Sep 15 '20 at 12:28
It said insufficient memory on all nodes. I feel like we are missing something in the configuration, but I don't know what. – mgershen Sep 15 '20 at 12:41

score 1 · Answer 2 · answered Sep 10 '20 at 08:42

1

There are two resource specifiers for each resource type.

Resource Request
Resource Limit

The Resource Request specifies the amount of a specific resource (CPU or Memory) that the pod should reserve. The pods are allowed to use more resources than what is requested - but not pass the Resource Limit.

As per the Kubernetes documentation:

When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.

Here is a typical configuration for resource request and limit.

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"

answered Sep 10 '20 at 08:42

Charlie

22,886
11
59
90

You have mentioned that you doesn't care which node the POD runs and also have only declared the resource limit - not the resource request. This is all you needs to know if you want to get the pods to allocate precises amount of memory in your config. – Charlie Sep 10 '20 at 11:28
But I already have these parameters that look like they should fit the nodes and I got the unexpected result. I am not sure what is your suggestion. – mgershen Sep 15 '20 at 10:50
If you set the resource request, kubernetes will allocate that amount. It will not allocate more.My suggestion is to see in all your pod configs if you have put the correct numbers. If kubernetes allocate more than what you have requested, that means there is an error in one of your config files. – Charlie Sep 15 '20 at 10:55
You can read the values in the table above. The problem is that it allocates too much on a given node. The expected VS actual tables shows the problem very clearly. – mgershen Sep 15 '20 at 11:00
Which pod is taking too much in your table? – Charlie Sep 15 '20 at 15:43
All pods are taking as they are configured to take. The problem is that the application fails to start. – mgershen Sep 15 '20 at 16:06
1

I don't get it. You said //The expected VS actual tables shows the problem very clearly What does it show? – Charlie Sep 15 '20 at 16:22
It shows the pods are not divided the way we expected. – mgershen Sep 16 '20 at 06:52
1

If you want any pod to be bound to a particular node, you should label the node and use a label selector in the pods config. – Charlie Sep 16 '20 at 07:23
The pods can go on whichever pod has enough memory. The problem is that the K8s placed them in a way that made the system fail to start. – mgershen Sep 22 '20 at 00:22
1

We have a system which has more than 50 different pods and many replicas of each too. But kubernates handles them perfectly. – Charlie Sep 22 '20 at 03:19

Nick · Answer 3 · 2020-09-16T13:04:50.090

You've touched a few topics within the same "stack overflow question".

Topic 1.

Is there a way to tell kubernetes that the Node should leave 4 GB of memory to the node itself? background: ... version 1.14 on an eks cluster

Official doc on topic says that it is configurable, if your Kubernetes server is at or later than version 1.8.

There is an old thread on GiHub about "--kube-reserved and --system-reserved are not working #72762" which is worth checking as well.

And a very comprehensive article that specifies how to prevent resource starvation of Critical System and Kubernetes Services.

Topic 2.

We expected the the wolf pods will be on one node, and the sheep pods on the other

You can constrain a Pod to only be able to run on particular Node(s) , or to prefer to run on particular nodes. There are several ways to do this, and the recommended approaches all use label selectors to make the selection.

nodeSelector is the simplest recommended form of node selection constraint. nodeSelector is a field of PodSpec. It specifies a map of key-value pairs. For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (it can have additional labels as well). The most common usage is one key-value pair.

while the OK pods will be splitted up between the nodes.

Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form "this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y".

Topic 3.

It sounds like it could work for the unit of 2 nodes use case I shared in detail, but for production we have many nodes, and we rather not configure them one by one

that is the way it supposed to work if you would like to custom place your pods ("wolves" on odd nodes, "sheeps" on even nodes and only one instance of OK, A-OK, BIG-OK per node).
"we rather not configure them one by one" - there are plenty of ways to managing infrastructure/labels/deployments but that is a separate question.

It sounds like it could work for the unit of 2 nodes use case I shared in detail, but for production we have many nodes, and we rather not configure them one by one. — mgershen, Sep 15 '20 at 10:54
basically, it is needed to label nodes and adjust Deployment file. — Nick, Sep 15 '20 at 12:09

Kubernetes placement causes pods to restart forever

3 Answers3