0

My tfjs-node-gpu code works great on an NVIDIA p4 on GKE (and using WebGL in a browser), but it fails on a v100 and t4.

Node is crashing in the first predict call inside my warmup. I'm using small 128x128 tiles to predict a 4x image upscale using the idealo-gans. The v100 initializes fine, shows up with nvidia_smi, is displayed as a TF device and the NUMA stuff is all fine. It just hard crashes my node express server. I'm having trouble finding the crash stack, since this is started in a Docker container and my last attempt to log the crash from stderr failed.

I've tried with both the latest tfjs-node-gpu 3.0 and 2.8.5. GKE is configured to install the NV drivers, currently 410.104, and CUDA 10.0.

I've tried enabling debug mode, and passing {verbose: true} to the failing model.predict() call in my warmup function. Neither added any output to the warmup call, which is odd, since I do see output in the actual, non-warmup call to model.predict()

Any suggestions on how to debug further?

Daniel Wexler
  • 131
  • 1
  • 7
  • Could it be somehow related to the version of node ? – edkeveked Feb 11 '21 at 08:36
  • Could you check if your nodes meet the GKE [requirements](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#requirements) for GPU. [Here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) you can find the driver to install depending on your OS and GKE version. – Dylan Feb 17 '21 at 15:22
  • Yes drivers have been verified to those versions as demonstrated by the working P4 case. – Daniel Wexler Feb 17 '21 at 17:17
  • As your case seems very specific to a GCP product it might suit better opening a Google [Issue Tracker](https://cloud.google.com/support/docs/issue-trackers) issue. – Dylan Feb 18 '21 at 15:35
  • Google requests that support questions be posted on Stack Overflow using the proper tags. – Daniel Wexler Feb 18 '21 at 22:55
  • Tracking [tfjs-node-gpu issue](https://github.com/tensorflow/tfjs/issues/4193) and its [merge request](https://github.com/tensorflow/tfjs/pull/4810) to update to the TF 2.4 core for backend support. I'm fairly sure this is a driver version issue. In my case, I think the GKE-installed Daemonset is installing a newer driver on my VM, while my container is using CUDA 10.1 and cudnn7, as required by tfjs-node-gpu. – Daniel Wexler Mar 22 '21 at 19:02

0 Answers0