1

I am trying to train an ML Model on Cloud AI Platform's Training Job via TFX+Kubeflow(Pipeline service).

Whenever the Trainer job is triggered, I see the log messages complaining something about CUDA.

2021-02-14 23:39:45.470214: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib

And I don't see any of the GPU(accelerator) is busy.

I think CUDA is available when I set scaleTier to like BASIC_GPU. However, I also need TFX's EntryPoint. I have not seen any of the official Dockerfile for building TFX+CUDA image.

Any suggestion? with trial and errors... I constantly losing GCP Credits...

박찬성
  • 11
  • 1
  • You should try existing Deep Learning Containers. https://cloud.google.com/ai-platform/deep-learning-containers/docs/choosing-container – gogasca Feb 18 '21 at 09:35
  • Found this [repo](https://github.com/valeriano-manassero/tfx-nvidia-gpu) which may be helpful in using `TFX with Kubeflow`, when the underlying Kubernetes has GPU supported nodes. Also you can check out `Vertex AI` which can help you run TFX pipeline with GPUs and this [page](https://www.tensorflow.org/tfx/tutorials/tfx/gcp/vertex_pipelines_simple) can be a quick starter about the same. Thanks! –  Jan 28 '22 at 17:53

0 Answers0