GPU nodegroup in EKS

Question

I am not able to create a nodegroup with GPU type using EKS, getting this error from cloud formation: [!] retryable error (Throttling: Rate exceeded status code: 400, request id: 1e091568-812c-45a5-860b-d0d028513d28) from cloudformation/DescribeStacks - will retry after delay of 988.442104ms

This is my clusterconfig.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
    name: CLUSTER_NAME
    region: AWS_REGION
nodeGroups:
    - name: NODE_GROUP_NAME_GPU
      ami: auto 
      minSize: MIN_SIZE
      maxSize: MAX_SIZE
      instancesDistribution:
        instanceTypes: ["g4dn.xlarge", "g4dn.2xlarge"] 
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 0
        spotInstancePools: 1
      privateNetworking: true
      securityGroups:
        withShared: true
        withLocal: true
        attachIDs: [SECURITY_GROUPS]
      iam:
        instanceProfileARN: IAM_PROFILE_ARN
        instanceRoleARN: IAM_ROLE_ARN
      ssh:
        allow: true
        publicKeyPath: '----'
      tags:
        k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true
        k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true'
        k8s.io/cluster-autoscaler/enabled: 'true'
      labels:
        lifecycle: Ec2Spot
        nvidia.com/gpu: 'true'
        k8s.amazonaws.com/accelerator: nvidia-tesla
      taints:
        nvidia.com/gpu: "true:NoSchedule"

Can you copy the error message in CloudFormation console -> Stack details to the question? — gohm'c, Apr 07 '22 at 08:20
The error is in my question : [!] retryable error (Throttling: Rate exceeded status code: 400, request id: 1e091568-812c-45a5-860b-d0d028513d28) from cloudformation/DescribeStacks - will retry after delay of 988.442104ms — Jumana Kass, Apr 07 '22 at 08:52
Goto CloudFormation console and look for the stack `eksctl-CLUSTER_NAME...`, goto Events tab and look for the failure in Status column. Copy the error (Status reason) to your question. — gohm'c, Apr 07 '22 at 09:03
in the cloudformation i can see that the node was created successfully but it does not added to the cluster - the cluster might be missing some driver related to GPU? do you have any idea? — Jumana Kass, Apr 11 '22 at 07:17

score 0 · Answer 1 · answered Apr 27 '22 at 07:38

0

the resolution was to install nividia plugins on the cluster so that the cluster will identify the gpu nodes

answered Apr 27 '22 at 07:38

Jumana Kass

15
5

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 27 '22 at 16:59

GPU nodegroup in EKS

1 Answers1