1

My team and I are setting up a pipeline in GCP, and we are trying to learn by running a notebook tutorial https://www.tensorflow.org/tfx/tutorials/tfx/cloud-ai-platform-pipelines. However, when we get to the step where we are creating the pipeline, this error shows up. Please help!

We ran:

    !tfx pipeline create  \
--pipeline-path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT} \
--build-target-image={CUSTOM_TFX_IMAGE} 

And we got:

`2021-02-09 08:21:49.170213: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH:` 
/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2021-02-09 08:21:49.170263: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
CLI
Creating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
Target image gcr.io/ts-ntnu-v2021-stm-ep3j/tfx-pipeline is not used. If the build spec is provided, update the target image in the build spec file build.yaml.
[Skaffold] Generating tags...
[Skaffold]  - gcr.io/ts-ntnu-v2021-stm-ep3j/tfx-pipeline -> gcr.io/ts-ntnu-v2021-stm-ep3j/tfx-pipeline:latest
[Skaffold] Checking cache...
[Skaffold]  - gcr.io/ts-ntnu-v2021-stm-ep3j/tfx-pipeline: Not found. Building
[Skaffold] Building [gcr.io/ts-ntnu-v2021-stm-ep3j/tfx-pipeline]...
[Skaffold] Sending build context to Docker daemon  2.056MB
[Skaffold] Step 1/4 : FROM tensorflow/tfx:0.26.1
[Skaffold]  ---> 6dd91a0791af
[Skaffold] Step 2/4 : WORKDIR /pipeline
[Skaffold]  ---> Using cache
[Skaffold]  ---> 7882f4facc06
[Skaffold] Step 3/4 : COPY ./ ./
[Skaffold]  ---> 2dbfe44eb3f1
[Skaffold] Step 4/4 : ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
[Skaffold]  ---> Running in b6bbdb97a2df
[Skaffold] Removing intermediate container b6bbdb97a2df
[Skaffold]  ---> d7d56f13fe6d
[Skaffold] Successfully built d7d56f13fe6d
[Skaffold] Successfully tagged gcr.io/ts-ntnu-v2021-stm-ep3j/tfx-pipeline:latest
[Skaffold] The push refers to repository [gcr.io/ts-ntnu-v2021-stm-ep3j/tfx-pipeline]
[Skaffold] 06e11ce4eea3: Preparing
[Skaffold] ab1902317977: Preparing
[Skaffold] 1a67ae26cf47: Preparing
[Skaffold] 25e69afdb83b: Preparing
[Skaffold] 2bd41d6594e3: Preparing
[Skaffold] 8e486d328b86: Preparing
[Skaffold] 8f42d0a1a747: Preparing
[Skaffold] 4058ae03fa32: Preparing
[Skaffold] e3437c61d457: Preparing
[Skaffold] 84ff92691f90: Preparing
[Skaffold] 54b00d861a7a: Preparing
[Skaffold] c547358928ab: Preparing
[Skaffold] 84ff92691f90: Preparing
[Skaffold] c4e66be694ce: Preparing
[Skaffold] 47cc65c6dd57: Preparing
[Skaffold] 8e486d328b86: Waiting
[Skaffold] 8f42d0a1a747: Waiting
[Skaffold] 4058ae03fa32: Waiting
[Skaffold] e3437c61d457: Waiting
[Skaffold] 84ff92691f90: Waiting
[Skaffold] 54b00d861a7a: Waiting
[Skaffold] c547358928ab: Waiting 
  [Skaffold] 47cc65c6dd57: Waiting
    [Skaffold] c4e66be694ce: Waiting
    [Skaffold] Build Failed. No push access to specified image repository. Trying running with `--default-repo` flag.
No container image is built.
Traceback (most recent call last):
  File "/opt/conda/bin/tfx", line 10, in <module>
    sys.exit(cli_group())
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
 File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
 File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/decorators.py", line 73, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
 File "/home/jupyter/.local/lib/python3.7/site-packages/tfx/tools/cli/commands/pipeline.py", line 117, in create_pipeline
    handler_factory.create_handler(ctx.flags_dict).create_pipeline()
  File "/home/jupyter/.local/lib/python3.7/site-packages/tfx/tools/cli/handler/kubeflow_handler.py", line 75, in create_pipeline
    skaffold_cmd)
 File "/home/jupyter/.local/lib/python3.7/site-packages/tfx/tools/cli/handler/kubeflow_handler.py", line 291, in _build_pipeline_image
    skaffold_cmd=skaffold_cmd).build()
  File "/home/jupyter/.local/lib/python3.7/site-packages/tfx/tools/cli/container_builder/builder.py", line 92, in build
    image_sha = skaffold_cli.build(self._buildspec)
      File "/home/jupyter/.local/lib/python3.7/site-packages/tfx/tools/cli/container_builder/skaffold_cli.py", line 61, in build
        spec.filename))

RuntimeError: skaffold failed to build an image with build.yaml.

-------- UPDATE ---------

I've found this in my logs if this helps:

enter image description here

Wojtek_B
  • 4,245
  • 1
  • 7
  • 21
gruppe3
  • 11
  • 2
  • Have you tried it with the recommended additional flag from the error message? "Trying running with `--default-repo` flag." Did that make a difference? This looks like a simple permissions error. Are you using one of the GCP AI Platform VMs? Can you confirm (through a notebook or similar) that it otherwise has access to the intended repository? Judging from the error log, you probably need to create the destination bucket manually, through the GCP UI. – Sarah Messer Feb 09 '21 at 19:05
  • Thank you for responsing! Yes, didnt make a difference. The notebook is running on a GCP instance, and the account should have all permissions to the Storage. I created the bucket manually, with all permissions, and put its path in the config.py file (as instructed by Tensorflow), but I still get the same error message. The account should have all permissions in the project environment as well. – gruppe3 Feb 09 '21 at 21:12
  • Can you use the notebook to read & write test files at the target location via the Python google_storage library? https://cloud.google.com/storage/docs/reference/libraries#client-libraries-usage-python – Sarah Messer Feb 09 '21 at 21:48
  • @gruppe3, Have you tried to verify [permissions](https://cloud.google.com/container-registry/docs/access-control#permissions_and_roles) granted to the service account from the attached screenshot, having possibility to push images to the container registry? – Nick_Kh Feb 10 '21 at 08:14
  • We created a service account with Storage admin role, so permission should not be a problem. We have not tried that yet @SarahMesser, do you know how? – gruppe3 Feb 10 '21 at 13:47
  • That error is from Docker (used by Skaffold as the builder). Could you please confirm that you performed step 2.4 to add the `.../cloud-platform` scope? https://www.tensorflow.org/tfx/tutorials/tfx/cloud-ai-platform-pipelines#2_set_up_and_deploy_an_ai_platform_pipeline_on_a_new_kubernetes_cluster – Brian de Alwis Feb 10 '21 at 16:47
  • Yes, we have done it several times as well just to be sure. Could it be something else? – gruppe3 Feb 11 '21 at 17:24
  • How did you specify `--default-repo` flag in prior of building target image? – Nick_Kh Feb 12 '21 at 13:24
  • After creating an account with the Storage Admin role, are you still getting: "[Skaffold] Build Failed. No push access to specified image repository." ? – RCrowe Mar 29 '21 at 17:26
  • I found a similar issue in `skaffold` repo (for Azure container repo), where it is mentioned that it should be already fixed in `V 1.22.0`. You can take a look at this [link](https://github.com/GoogleContainerTools/skaffold/issues/5601#issuecomment-837357747) for your reference. Please try the same in latest version and let us know if the issue still persists. Thanks! –  Mar 21 '22 at 17:33

0 Answers0