2

Using Terraform I spin up the following resources for my primary using a unique service account for this cluster:

resource "google_container_cluster" "primary" {
  name     = var.gke_cluster_name
  location = var.region
  
  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  ip_allocation_policy {}
  networking_mode = "VPC_NATIVE"

  node_config {
    service_account = google_service_account.cluster_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  cluster_autoscaling {
    enabled = true
    
    resource_limits {
      resource_type = "cpu"
      maximum = 40
      minimum = 3
    }

    resource_limits {
      resource_type = "memory"
      maximum = 100
      minimum = 12
    }
  }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
}

resource "google_container_node_pool" "primary_nodes" {
  name       = "${google_container_cluster.primary.name}-node-pool"
  location   = var.region
  cluster    = google_container_cluster.primary.name
  node_count = var.gke_num_nodes

  node_config {

    service_account = google_service_account.cluster_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    labels = {
      env = var.project_id
    }
    disk_size_gb = 150
    preemptible  = true
    machine_type = var.machine_type
    tags         = ["gke-node", "${var.project_id}-gke"]
    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

Even though I provide with the nodes with the appropriate permissions to pull from the google container registry (roles/containerregistry.ServiceAgent) sometimes I get an ImagePullError randomly from kubernetes:

Unexpected status code [manifests latest]: 401 Unauthorized

After using the following command to inspect the service accounts assigned to the node pools:

gcloud container clusters describe master-env --zone="europe-west2" | grep "serviceAccount"

I see the following output:

serviceAccount: default
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com
serviceAccount: master-env@<project-id>.iam.gserviceaccount.com

Indicating that although I've specified the correct service account to assign to the nodes, for some reason (I think for the primary pool) it instead assigns the default service account which uses the wrong oauth scopes:

oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring

Instead of https://www.googleapis.com/auth/cloud-platform.

How can I make sure the same service account is used for all nodes?

Edit 1:

After implementing the fix from @GariSingh now all my node-pools use the same Service Account as expected however I still get the unexpected status code [manifests latest]: 401 Unauthorized error sometimes when installing my services onto the cluster.

This unusual, as other services installed onto cluster seem to pull their images from gcr without issue.

Describing the pod events shows the following:

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    11m                  default-scheduler  Successfully assigned default/<my-deployment> to gke-master-env-nap-e2-standard-2-<id>
  Warning  FailedMount  11m                  kubelet            MountVolume.SetUp failed for volume "private-key" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount  11m                  kubelet            MountVolume.SetUp failed for volume "kube-api-access-5hh9r" : failed to sync configmap cache: timed out waiting for the condition
  Warning  Failed       9m34s (x5 over 10m)  kubelet            Error: ImagePullBackOff

Edit 2:

The final piece of the puzzle was to add oauth_scopes to the auto_provisioning_defaults similar to the node configs so that the ServiceAccount could be used properly.

1 Answers1

2

Not sure if you intended to use Node auto-provisioning (NAP) (which I highly recommend you use unless it does not meet your needs), but the cluster_autoscaling argument for google_container_cluster actually enables this. It does not enable the cluster autoscaler for individual node pools.

If your goal is to enable cluster autoscaling for the node pool you created in your config and not use NAP, then you'll need to delete the cluster_autoscaling block and add an autoscaling block under your google_container_node_pool resource and change node_count to initial_node_count:

resource "google_container_node_pool" "primary_nodes" {
  name       = "${google_container_cluster.primary.name}-node-pool"
  location   = var.region
  cluster    = google_container_cluster.primary.name
  initial_node_count = var.gke_num_nodes

  node_config {
    autoscaling {
      min_node_count = var.min_nodes
      max_node_count = var.max_nodes
    }
    service_account = google_service_account.cluster_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    labels = {
      env = var.project_id
    }
    disk_size_gb = 150
    preemptible  = true
    machine_type = var.machine_type
    tags         = ["gke-node", "${var.project_id}-gke"]
    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

(the above assume you set variables for min and max nodes)

If you do want to use NAP, then you'll need to add an auto_provisioning_defaults block and configure the service_account property:

resource "google_container_cluster" "primary" {
  name     = var.gke_cluster_name
  location = var.region
  
  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  ip_allocation_policy {}
  networking_mode = "VPC_NATIVE"

  node_config {
    service_account = google_service_account.cluster_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  cluster_autoscaling {
    enabled = true
    
    auto_provisioning_defaults {
      service_account = google_service_account.cluster_sa.email
    }      

    resource_limits {
      resource_type = "cpu"
      maximum = 40
      minimum = 3
    }

    resource_limits {
      resource_type = "memory"
      maximum = 100
      minimum = 12
    }
  }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
}
Gari Singh
  • 11,418
  • 2
  • 18
  • 41