I am trying to train an ML Model on Cloud AI Platform's Training Job via TFX+Kubeflow(Pipeline service).
Whenever the Trainer job is triggered, I see the log messages complaining something about CUDA.
2021-02-14 23:39:45.470214: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
And I don't see any of the GPU(accelerator) is busy.
I think CUDA is available when I set scaleTier
to like BASIC_GPU
. However, I also need TFX
's EntryPoint
. I have not seen any of the official Dockerfile
for building TFX+CUDA image.
Any suggestion? with trial and errors... I constantly losing GCP Credits...