I'm trying to get a GKE cluster (Google Cloud Kubernetes), provisioned by Terraform, running with a GPU node pool. If someone can point me at what I'm missing to get the GPU node pool working, that would be most awesome.
I am able to run work on the CPU node pool but haven't been able to get the drivers installed for the GPU node pool. There is good documentation on how to set this up but when I try to follow it, I get an error on the GPU nodes that says Can't access efivars filesystem at /sys/firmware/efi/efivars, aborting
, from running the daemonset the docs point me at.
I am using the Ubuntu
image on n1-standard-16
instances with the T4
GPUs and I can confirm that the nodes are running with kubernetes version 1.11.10-gke.5
.
An interesting note I think might be a clue is that in the node details page that you can get to by navigating to the cluster, then to the nodes in the cluster, then to one of the GPU node's details is that it lists the count for GPUs as 0 even though it shows that I have GPU accelerators
at 1 per node, from the node pool details page. I am totally guessing here but I am thinking this might be because I haven't requested GPU resources correctly for this node pool but I can't seem to figure out how that would fit in the Terraform google_container_node_pool resource. I do have this in the google_container_node_pool
for the GPU node pool, though:
resource "google_container_node_pool" "gpu_training_nodes" {
...
node_config {
...
guest_accelerator {
type = "nvidia-tesla-t4"
count = 1
}
}
}