Using Terraform I spin up the following resources for my primary using a unique service account for this cluster:
resource "google_container_cluster" "primary" {
name = var.gke_cluster_name
location = var.region
# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1
ip_allocation_policy {}
networking_mode = "VPC_NATIVE"
node_config {
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
cluster_autoscaling {
enabled = true
resource_limits {
resource_type = "cpu"
maximum = 40
minimum = 3
}
resource_limits {
resource_type = "memory"
maximum = 100
minimum = 12
}
}
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
}
resource "google_container_node_pool" "primary_nodes" {
name = "${google_container_cluster.primary.name}-node-pool"
location = var.region
cluster = google_container_cluster.primary.name
node_count = var.gke_num_nodes
node_config {
service_account = google_service_account.cluster_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
labels = {
env = var.project_id
}
disk_size_gb = 150
preemptible = true
machine_type = var.machine_type
tags = ["gke-node", "${var.project_id}-gke"]
metadata = {
disable-legacy-endpoints = "true"
}
}
}
Even though I provide with the nodes with the appropriate permissions to pull from the google container registry (roles/containerregistry.ServiceAgent
) sometimes I get an ImagePullError
randomly from kubernetes:
Unexpected status code [manifests latest]: 401 Unauthorized
After using the following command to inspect the service accounts assigned to the node pools:
gcloud container clusters describe master-env --zone="europe-west2" | grep "serviceAccount"
I see the following output:
serviceAccount: default
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com
Indicating that although I've specified the correct service account to assign to the nodes, for some reason (I think for the primary
pool) it instead assigns the default
service account which uses the wrong oauth
scopes:
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
Instead of https://www.googleapis.com/auth/cloud-platform
.
How can I make sure the same service account is used for all nodes?
Edit 1:
After implementing the fix from @GariSingh now all my node-pools use the same Service Account
as expected however I still get the unexpected status code [manifests latest]: 401 Unauthorized
error sometimes when installing my services onto the cluster.
This unusual, as other services installed onto cluster seem to pull their images from gcr
without issue.
Describing the pod events shows the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned default/<my-deployment> to gke-master-env-nap-e2-standard-2-<id>
Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "private-key" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 11m kubelet MountVolume.SetUp failed for volume "kube-api-access-5hh9r" : failed to sync configmap cache: timed out waiting for the condition
Warning Failed 9m34s (x5 over 10m) kubelet Error: ImagePullBackOff
Edit 2:
The final piece of the puzzle was to add oauth_scopes
to the auto_provisioning_defaults
similar to the node configs so that the ServiceAccount
could be used properly.