0

First of all, I want to emphasize that this is not a duplicate of (e.g.) this one.

Problem description:

I am running Kubeflow Pipelines (set up on GCP AI Platform Pipelines) on a GKE Cluster. Each pipleine consists of several components (i.e. docker containers / i.e. Pods). If I enforce that there can only be one running Pod per node, everything works as expected and files can be uploaded from that node to the target gcs bucket. In my conclusion, there shouldn't be a permission problem in the first place, right?

However, when multiple Pods (>1) are running on the same node in the pool in order to parallelize pipelines execution and accomplish optimal resource usage, an Error occurs:

google.api_core.exceptions.Forbidden: 403 POST 
https://storage.googleapis.com/upload/storage/v1/b/my_bucket/o?uploadType=resumable
: {
  "error": {
    "code": 403,
    "message": "Insufficient Permission",
    "errors": [
      {
        "message": "Insufficient Permission",
        "domain": "global",
        "reason": "insufficientPermissions"
      }
    ]
  }
}
: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.CREATED: 201>)

Also, it's worth mentioning that previously failed uploads to GCS will most of the time succeed, when I just clone the failed pipeline run and restart it. This is probably because there is no other (conflicting) Pod on the same node in the new run.

I am uploading files from a VM (cluster node) to a google cloud storage bucket like this:

src_file = 'my_sourcefile_in_this_docker_container'
bucket_name = 'my_buckname'
gcs_target_path = 'my_gcs_path'

GCS_CLIENT = storage.Client()
gcs_bucket = GCS_CLIENT.bucket(bucket_name)
gcs_bucket.blob(gcs_target_path).upload_from_filename(src_file, timeout=300)

The error does not occur always in the same pipeline i.e. component, but rather some sort of randomly. It happens to me that there might be some conflicts between containers or the storage.Client() connections they create when trying to upload files, but I might be wrong or miss something here.

What I've tried so far to tackle the issue (unfortunately without any success):

  • I decorated my upload function code with a retry strategy, which triggers the upload function to be called several times while increasing backoff time exponentially up to 2 minutes for a maximum of 20 trials
  • before I upload a file, I am deleting the target file in the storage bucket in case it already exists
  • I created the worker nodepool on which pipeline execution takes place setting full permissions to storage:
gcloud container node-pools create kfp-worker-pool \
--cluster=$KFP_CLUSTER_NAME \
--zone=$COMPUTE_ZONE \
--machine-type=$MACHINE_TYPE \
--scopes=cloud-platform,storage-full \
--num-nodes=$NUM_NODES

What I haven't tried yet, since I don't think it's very promising and I am running out of time:

  • I haven't recreated the cluster as suggested here, since I am re-creating the nodepool for pipeline execution (not the default-pool, which is seperate) everytime before I start the pipelines execution (also there are no permission issues, when I run pipelines in the fashion of one Pod per Node at a time)

I'd greatly aprreciate any solutions or even ideas how to investigate the issue further. Thanks for your help.

Peabuddy
  • 1
  • 1
  • 2
    Do you use workload identity? – guillaume blaquiere Apr 12 '21 at 18:26
  • No, I haven't used workload identity so far. Would that be helpful and why? – Peabuddy Apr 13 '21 at 07:52
  • 2
    No, it's the opposite. I though that workload identity could have issues! – guillaume blaquiere Apr 13 '21 at 08:51
  • 1
    Hello, workload identity in short is used to assign specific GCP permissions (like storage access) to the Kubernetes service accounts. It could solve your permission issues. If it's feasible please try it and let know if it solved your issue: [guide](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity). Also have you tried to reproduce this error "manually" (create `Pods` on the same `Node` that are uploading objects)? If it's feasible, please provide [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Dawid Kruk Apr 13 '21 at 09:52
  • @DawidKruk - Thanks for your advice. I'll have a look into it. I will check if workload identity control solves my problem and if not I'll check, if the same error get's raised when I manually allocate two pods (with uploads) on the same cluster node. Anyway: Can you elobarate on why workload identity should make a difference here, considering that uploads succeed without that additional setting in a single pod per node scenario? – Peabuddy Apr 13 '21 at 17:29
  • @Peabuddy I've tried to reproduce your use case and found no issues while using authentication scopes. By trying workload identity and by manual testing you'll be able to exclude parts of your solution and limit the area where you should look for the cause of your issue. During my reproduction, I've used [code](https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python), [image](https://hub.docker.com/r/google/cloud-sdk/), on a single `GKE` node, accessing the object simultaneously) – Dawid Kruk Apr 14 '21 at 15:24
  • @DawidKruk Thanks for your efforts. I tried to use workload identity now. Everything has been setup as instructed by the gcp manual you referred to in your previous comment. However, I am receiving another error when trying to run _upload_from_filname() or any other call of the storage API's GET method: google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/projects/my-project/serviceAccount?prettyPrint=false : Caller does not have resourcemanager.projects.get access to the Google Cloud project. – Peabuddy Apr 16 '21 at 13:42
  • 1
    @DawidKruk When you tried to reproduce my issue, you have probably not used a kubeflow pipeline component, but an ordinary docker container, right? Maybe there is an issue with kubeflow? Sorry, I am just guessing, but I am rather clueless at the moment. – Peabuddy Apr 16 '21 at 13:51

0 Answers0