GCP AI Platform - Pipelines - Clusters - Does not have minimum availability

Question

I can't create pipelines. I can't even load the samples / tutorials on the AI Platform Pipelines Dashboard because it doesn't seem to be able to proxy to whatever it needs to.

An error occurred
Error occured while trying to proxy to: ...

I looked into the cluster's details and found 3 components with errors:

Deployment  metadata-grpc-deployment     Does not have minimum availability 
Deployment  ml-pipeline  Does not have minimum availability 
Deployment  ml-pipeline-persistenceagent     Does not have minimum availability

Creating the clusters involve approx. 3 clicks in GCP Kubernetes Engine so I don't think I messed up this step.

Anyone have an idea of how to achieve "minimum availability"?

UPDATE 1

Nodes have adequate resources and are Ready. YAML file looks good. I have 2 clusters in diff regions/zones and both have the deployment errors listed above. 2 Pods are not ok.

Name:         ml-pipeline-65479485c8-mcj9x
Namespace:    default
Priority:     0
Node:         gke-cluster-3-default-pool-007784cb-qcsn/10.150.0.2
Start Time:   Thu, 17 Sep 2020 22:15:19 +0000
Labels:       app=ml-pipeline
              app.kubernetes.io/name=kubeflow-pipelines-3
              pod-template-hash=65479485c8
Annotations:  kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container ml-pipeline-api-server

Status:       Running
IP:           10.4.0.8
IPs:
IP:           10.4.0.8
Controlled By:  ReplicaSet/ml-pipeline-65479485c8
Containers:
  ml-pipeline-api-server:
    Container ID:   ...
    Image:          ...
    Image ID:       ...
    Ports:          8888/TCP, 8887/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Fri, 18 Sep 2020 10:27:31 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 18 Sep 2020 10:20:38 +0000
      Finished:     Fri, 18 Sep 2020 10:27:31 +0000
    Ready:          False
    Restart Count:  98
    Requests:
      cpu:      100m
    Liveness:   exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:  exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment:
      HAS_DEFAULT_BUCKET:                   true
      BUCKET_NAME:
      PROJECT_ID:                           <set to the key 'project_id' of config map 'gcp-default-config'>  Optional: false
      POD_NAMESPACE:                        default (v1:metadata.namespace)
      DEFAULTPIPELINERUNNERSERVICEACCOUNT:  pipeline-runner
      OBJECTSTORECONFIG_SECURE:             false
      OBJECTSTORECONFIG_BUCKETNAME:
      DBCONFIG_DBNAME:                      kubeflow_pipelines_3_pipeline
      DBCONFIG_USER:                        <set to the key 'username' in secret 'mysql-credential'>  Optional: false
      DBCONFIG_PASSWORD:                    <set to the key 'password' in secret 'mysql-credential'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from ml-pipeline-token-77xl8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  ml-pipeline-token-77xl8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ml-pipeline-token-77xl8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From                                               Message
  ----     ------     ----                  ----                                               -------
  Warning  BackOff    52m (x409 over 11h)   kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Back-off restarting failed container
  Warning  Unhealthy  31m (x94 over 12h)    kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Readiness probe failed:
  Warning  Unhealthy  31m (x29 over 10h)    kubelet, gke-cluster-3-default-pool-007784cb-qcsn  (combined from similar events): Readiness probe failed: c
annot exec in a stopped state: unknown
  Warning  Unhealthy  17m (x95 over 12h)    kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Liveness probe failed:
  Normal   Pulled     7m26s (x97 over 12h)  kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Container image "gcr.io/cloud-marketplace/google-cloud-ai
-platform/kubeflow-pipelines/apiserver:1.0.0" already present on machine
  Warning  Unhealthy  75s (x78 over 12h)    kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Liveness probe errored: rpc error: code = DeadlineExceede
d desc = context deadline exceeded

And the other pod:

Name:         ml-pipeline-persistenceagent-67db8b8964-mlbmv
Events:
  Type     Reason   Age                   From                                               Message
  ----     ------   ----                  ----                                               -------
  Warning  BackOff  32s (x2238 over 12h)  kubelet, gke-cluster-3-default-pool-007784cb-qcsn  Back-off restarting failed container

SOLUTION

Do not let google handle any storage. Uncheck "Use managed storage" and set up your own artifact collections manually. You don't actually need to enter anything in these fields since the pipeline will be launched anyway.

NOTICE: Kubeflow dev said to uncheck "Use managed storage" on GCP when setting up the cluster. This solved the pods issues. We don't know why. — Sterls, Sep 22 '20 at 10:58

score 1 · Accepted Answer · answered Sep 18 '20 at 08:11

1

The Does not have minimum availability error is generic. There could be many issues that trigger it. You need to analyse more in-depth in order to find the actual problem. Here are some possible causes:

Insufficient resources: check if your Node has adequate resources (CPU/Memory). If Node is ok than check the Pod's status.
Liveliness probe and/or Readiness probe failure: execute kubectl describe pod <pod-name> to check if they failed and why.
Deployment misconfiguration: review your deployment yaml file to see if there are any errors or leftovers from previous configurations.
You can also try to wait a bit as sometimes it takes some time in order to deploy everything and/or try changing your Region/Zone.

answered Sep 18 '20 at 08:11

Wytrzymały Wiktor

11,492
5
29
37

I added some more info. Is there anything that stands out as problematic for you? – Sterls Sep 18 '20 at 11:03
Looks like Liveness and Readiness probes are failing. You need to check if they are correctly configured. Use [this](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) as a general guide. – Wytrzymały Wiktor Sep 18 '20 at 11:09
Everything I've done with these clusters and pipelines are through GCP GUI. GCP makes it seem as though I can point and click and magically everything should be up and running. This is why I don't know if I am supposed to actually configure those files at all. From my search, the yaml files look the way they should. – Sterls Sep 21 '20 at 16:43
Still, there is a problem with your Liveliness and Readiness probes. In order to fix it you need to verify if they are correctly configured. I think it would be best if you ask a separate question explaining this particular problem just to not mix up multiple topics here as in this question you asked about the possible causes of the `Does not have minimum availability` error and we got that covered. – Wytrzymały Wiktor Sep 22 '20 at 09:06
Agreed. The Kubeflow dev said not to check the box "Use Managed Storage" when setting up the cluster. This solved the the pods issues. I think there's something wrong with how GCP tries to automate handling the persistence. (Anyway, on to the other issue that popped up lol) – Sterls Sep 22 '20 at 10:58

GCP AI Platform - Pipelines - Clusters - Does not have minimum availability

1 Answers1