1

I'm trying to get a GKE cluster (Google Cloud Kubernetes), provisioned by Terraform, running with a GPU node pool. If someone can point me at what I'm missing to get the GPU node pool working, that would be most awesome.

I am able to run work on the CPU node pool but haven't been able to get the drivers installed for the GPU node pool. There is good documentation on how to set this up but when I try to follow it, I get an error on the GPU nodes that says Can't access efivars filesystem at /sys/firmware/efi/efivars, aborting, from running the daemonset the docs point me at.

I am using the Ubuntu image on n1-standard-16 instances with the T4 GPUs and I can confirm that the nodes are running with kubernetes version 1.11.10-gke.5.

An interesting note I think might be a clue is that in the node details page that you can get to by navigating to the cluster, then to the nodes in the cluster, then to one of the GPU node's details is that it lists the count for GPUs as 0 even though it shows that I have GPU accelerators at 1 per node, from the node pool details page. I am totally guessing here but I am thinking this might be because I haven't requested GPU resources correctly for this node pool but I can't seem to figure out how that would fit in the Terraform google_container_node_pool resource. I do have this in the google_container_node_pool for the GPU node pool, though:

resource "google_container_node_pool" "gpu_training_nodes" {    
  ...
  node_config {
    ...
    guest_accelerator {
      type  = "nvidia-tesla-t4"
      count = 1
    }
  }
}
Adam
  • 121
  • 2

1 Answers1

1

I was able to get GPUs to show up and be usable by bringing all of the nodes in the cluster up to the same kubernetes version. Previously, the master and CPU nodes were on 1.11.6-gke.11. I have no idea how this helped but it was the only change I made. It's possible that doing the update brought down, then re-provisioned certain resources but it didn't have to bring nodes down or anything so dramatic so I'm not sure how it made the difference...

I still get the efivars error but it doesn't seem to matter (yet).

Adam
  • 121
  • 2