I had the same problem again in production, but after I restarted all pods several times, all pods were restored to the correct worker nodes.
Then I noticed something weird, every time I found pods on incorrect worker nodes, they were created very close together.
So I guess that if pods and worker nodes start at the same time, before eks has not marked the taint on the worker node, the pod maybe put into the worker node with the mismatching taint.
I tried some things to solve this problem and it works in my environment:
- Set the nodeSelector or nodeAffinity on pod, then pod will check node whether have the match label before placed into the work node
- Change the effect to NoExecute in taint and toleration (if the pod does not match the label, it will be evicted to other worker nodes)
Hope those informations help you resolve your issue.