First of all, I want to emphasize that this is not a duplicate of (e.g.) this one.
Problem description:
I am running Kubeflow Pipelines (set up on GCP AI Platform Pipelines) on a GKE Cluster. Each pipleine consists of several components (i.e. docker containers / i.e. Pods). If I enforce that there can only be one running Pod per node, everything works as expected and files can be uploaded from that node to the target gcs bucket. In my conclusion, there shouldn't be a permission problem in the first place, right?
However, when multiple Pods (>1) are running on the same node in the pool in order to parallelize pipelines execution and accomplish optimal resource usage, an Error occurs:
google.api_core.exceptions.Forbidden: 403 POST
https://storage.googleapis.com/upload/storage/v1/b/my_bucket/o?uploadType=resumable
: {
"error": {
"code": 403,
"message": "Insufficient Permission",
"errors": [
{
"message": "Insufficient Permission",
"domain": "global",
"reason": "insufficientPermissions"
}
]
}
}
: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.CREATED: 201>)
Also, it's worth mentioning that previously failed uploads to GCS will most of the time succeed, when I just clone the failed pipeline run and restart it. This is probably because there is no other (conflicting) Pod on the same node in the new run.
I am uploading files from a VM (cluster node) to a google cloud storage bucket like this:
src_file = 'my_sourcefile_in_this_docker_container'
bucket_name = 'my_buckname'
gcs_target_path = 'my_gcs_path'
GCS_CLIENT = storage.Client()
gcs_bucket = GCS_CLIENT.bucket(bucket_name)
gcs_bucket.blob(gcs_target_path).upload_from_filename(src_file, timeout=300)
The error does not occur always in the same pipeline i.e. component, but rather some sort of randomly. It happens to me that there might be some conflicts between containers or the storage.Client() connections they create when trying to upload files, but I might be wrong or miss something here.
What I've tried so far to tackle the issue (unfortunately without any success):
- I decorated my upload function code with a retry strategy, which triggers the upload function to be called several times while increasing backoff time exponentially up to 2 minutes for a maximum of 20 trials
- before I upload a file, I am deleting the target file in the storage bucket in case it already exists
- I created the worker nodepool on which pipeline execution takes place setting full permissions to storage:
gcloud container node-pools create kfp-worker-pool \
--cluster=$KFP_CLUSTER_NAME \
--zone=$COMPUTE_ZONE \
--machine-type=$MACHINE_TYPE \
--scopes=cloud-platform,storage-full \
--num-nodes=$NUM_NODES
What I haven't tried yet, since I don't think it's very promising and I am running out of time:
- I haven't recreated the cluster as suggested here, since I am re-creating the nodepool for pipeline execution (not the default-pool, which is seperate) everytime before I start the pipelines execution (also there are no permission issues, when I run pipelines in the fashion of one Pod per Node at a time)
I'd greatly aprreciate any solutions or even ideas how to investigate the issue further. Thanks for your help.