1

Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example (https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/tutorials/training/spark/) using tensorflow, and it was executed successfully on both nodes (utilizing GPUs on both nodes). However when I am using KerasEstimator on spark, the training executes successfully but I think that only one gpu is getting used.

I am following this example: https://docs.databricks.com/_static/notebooks/deep-learning/horovod-spark-estimator-keras.html

here are the logs:

[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Bootstrap : Using eth0:10.84.52.31<0>
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/IB : No device found.
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.52.31<0>
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Using network Socket
[1,0]:NCCL version 2.11.4+cuda11.4
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Bootstrap : Using eth0:10.84.179.52<0>
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/IB : No device found.
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO NET/Socket : Using [0]eth0:10.84.179.52<0>
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Using network Socket
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00/02 : 0 1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01/02 : 0 1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [receive] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [receive] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [send] via NET/Socket/0
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 1[4000] -> 0[2000] [receive] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 1[4000] -> 0[2000] [receive] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 00 : 0[2000] -> 1[4000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Channel 01 : 0[2000] -> 1[4000] [send] via NET/Socket/0
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all rings
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Connected all trees
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO comm 0x7fd2247488e0 rank 0 nranks 2 cudaDev 0 busId 2000 - Init COMPLETE
[1,0]:fraud-engine-application-5422-6f5af3856318205f-exec-1:246:259 [0] NCCL INFO Launch mode Parallel
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all rings
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO Connected all trees
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
[1,1]:fraud-engine-application-5422-6f5af3856318205f-exec-2:1240:1253 [0] NCCL INFO comm 0x7fad647478a0 rank 1 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
[1,0]:
[1,1]:WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.0086s vs on_train_batch_end time: 0.0658s). Check your callbacks.
[1,0]:WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.0053s vs on_train_batch_end time: 0.0687s). Check your callbacks.
1/4851 [..............................] - ETA: 8:35:39 - loss: 1.0356 - accuracy: 0.4844[1,0]: ����
9/4851 [..............................] - ETA: 30s - loss: 0.9629 - accuracy: 0.4219 [1,0]:
17/4851 [..............................] - ETA: 31s - loss: 0.9131 - accuracy: 0.4265[1,0]:
24/4851 [..............................] - ETA: 33s - loss: 0.8747 - accuracy: 0.4421[1,0]:
31/4851 [..............................] - ETA: 34s - loss: 0.8364 - accuracy: 0.4768[1,0]:
39/4851 [..............................] - ETA: 34s - loss: 0.7905 - accuracy: 0.5445[1,0]:
48/4851 [..............................] - ETA: 32s - loss: 0.7389 - accuracy: 0.6286[1,0]:
56/4851 [..............................] - ETA: 32s - loss: 0.6957 - accuracy: 0.6816[1,0]:
64/4851 [..............................] - ETA: 32s - loss: 0.6540 - accuracy: 0.7214[1,0]:
71/4851 [..............................] - ETA: 32s - loss: 0.6205 - accuracy: 0.7489[1,0]:
79/4851 [..............................] - ETA: 32s - loss: 0.5844 - accuracy: 0.7743[1,0]:
87/4851 [..............................] - ETA: 32s - loss: 0.5504 - accuracy: 0.7951[1,0]:
95/4851 [..............................] - ETA: 32s - loss: 0.5194 - accuracy: 0.8123[1,0]:
103/4851 [..............................] - ETA: 32s - loss: 0.4912 - accuracy: 0.8269[1,0]:
112/4851 [..............................] - ETA: 31s - loss: 0.4623 - accuracy: 0.8408[1,0]:
121/4851 [..............................] - ETA: 31s - loss: 0.4364 - accuracy: 0.8525[1,0]:
131/4851 [..............................] - ETA: 30s - loss: 0.4106 - accuracy: 0.8637[1,0]:
140/4851 [..............................] - ETA: 30s - loss: 0.3886 - accuracy: 0.8724[1,0]:
148/4851 [..............................] - ETA: 30s - loss: 0.3706 - accuracy: 0.8793[1,0]:
156/4851 [..............................] - ETA: 30s - loss: 0.3542 - accuracy: 0.8855[1,0]:
164/4851 [>.............................] - ETA: 30s - loss: 0.3388 - accuracy: 0.8911[1,0]:
172/4851 [>.............................] - ETA: 30s - loss: 0.3246 - accuracy: 0.8962[1,0]:
180/4851 [>.............................] - ETA: 30s - loss: 0.3116 - accuracy: 0.9008[1,0]:
188/4851 [>.............................] - ETA: 30s - loss: 0.2994 - accuracy: 0.9050[1,0]:
196/4851 [>.............................] - ETA: 30s - loss: 0.2882 - accuracy: 0.9089[1,0]:
204/4851 [>.............................] - ETA: 30s - loss: 0.2778 - accuracy: 0.9125[1,0]:
212/4851 [>.............................] - ETA: 30s - loss: 0.2680 - accuracy: 0.9158[1,0]:
220/4851 [>.............................] - ETA: 30s - loss: 0.2588 - accuracy: 0.9188[1,0]:
227/4851 [>.............................] - ETA: 30s - loss: 0.2513 - accuracy: 0.9213[1,0]:
235/4851 [>.............................] - ETA: 30s - loss: 0.2432 - accuracy: 0.9240[1,0]:
243/4851 [>.............................] - ETA: 30s - loss: 0.2356 - accuracy: 0.9265[1,0]:
251/4851 [>.............................] - ETA: 30s - loss: 0.2285 - accuracy: 0.9288[1,0]:
259/4851 [>.............................] - ETA: 30s - loss: 0.2218 - accuracy: 0.9310[1,0]:
267/4851 [>.............................] - ETA: 30s - loss: 0.2155 - accuracy: 0.9331[1,0]:
275/4851 [>.............................] - ETA: 30s - loss: 0.2095 - accuracy: 0.9351[1,0]:
283/4851 [>.............................] - ETA: 30s - loss: 0.2038 - accuracy: 0.9369[1,0]:
291/4851 [>.............................] - ETA: 30s - loss: 0.1985 - accuracy: 0.9386[1,0]:
299/4851 [>.............................] - ETA: 30s - loss: 0.1933 - accuracy: 0.9403[1,0]:
307/4851 [>.............................] - ETA: 30s - loss: 0.1885 - accuracy: 0.9418[1,0]:
316/4851 [>.............................] - ETA: 30s - loss: 0.1833 - accuracy: 0.9435[1,0]:
325/4851 [=>............................] - ETA: 30s - loss: 0.1784 - accuracy: 0.9450[1,0]:
334/4851 [=>............................] - ETA: 30s - loss: 0.1738 - accuracy: 0.9465[1,0]:
343/4851 [=>............................] - ETA: 30s - loss: 0.1694 - accuracy: 0.9479[1,0]:
351/4851 [=>............................] - ETA: 29s - loss: 0.1656 - accuracy: 0.9491[1,0]:
358/4851 [=>............................] - ETA: 30s - loss: 0.1625 - accuracy: 0.9501[1,0]:
366/4851 [=>............................] - ETA: 29s - loss: 0.1590 - accuracy: 0.9512[1,0]:
374/4851 [=>............................] - ETA: 29s - loss: 0.1557 - accuracy: 0.9522[1,0]:
383/4851 [=>............................] - ETA: 29s - loss: 0.1521 - accuracy: 0.9534[1,0]:
391/4851 [=>............................] - ETA: 29s - loss: 0.1491 - accuracy: 0.9543[1,0]:
400/4851 [=>............................] - ETA: 29s - loss: 0.1458 - accuracy: 0.9554[1,0]:
408/4851 [=>............................] - ETA: 29s - loss: 0.1430 - accuracy: 0.9562[1,0]:
417/4851 [=>............................] - ETA: 29s - loss: 0.1400 - accuracy: 0.9572[1,0]:
422/4851 [=>............................] - ETA: 29s - loss: 0.1384 - accuracy: 0.9577[1,0]:
428/4851 [=>............................] - ETA: 29s - loss: 0.1365 - accuracy: 0.9583[1,0]:
437/4851 [=>............................] - ETA: 29s - loss: 0.1338 - accuracy: 0.9591[1,0]:
447/4851 [=>............................] - ETA: 29s - loss: 0.1314 - accuracy: 0.9600[1,0]:
456/4851 [=>............................] - ETA: 29s - loss: 0.1289 - accuracy: 0.9608[1,0]:
465/4851 [=>............................] - ETA: 29s - loss: 0.1264 - accuracy: 0.9616[1,0]:
474/4851 [=>............................] - ETA: 29s - loss: 0.1241 - accuracy: 0.9623[1,0]:
483/4851 [=>............................] - ETA: 29s - loss: 0.1218 - accuracy: 0.9630[1,0]:
491/4851 [==>...........................] - ETA: 28s - loss: 0.1199 - accuracy: 0.9636[1,0]:
499/4851 [==>...........................] - ETA: 28s - loss: 0.1180 - accuracy: 0.9642[1,0]:
508/4851 [==>...........................] - ETA: 28s - loss: 0.1160 - accuracy: 0.9648[1,0]:
518/4851 [==>...........................] - ETA: 28s - loss: 0.1138 - accuracy: 0.9655[1,0]:
527/4851 [==>...........................] - ETA: 28s - loss: 0.1118 - accuracy: 0.9661[1,0]:
536/4851 [==>...........................] - ETA: 28s - loss: 0.1100 - accuracy: 0.9667[1,0]:
545/4851 [==>...........................] - ETA: 28s - loss: 0.1082 - accuracy: 0.9672[1,0]:
554/4851 [==>...........................] - ETA: 28s - loss: 0.1065 - accuracy: 0.9677[1,0]:
562/4851 [==>...........................] - ETA: 28s - loss: 0.1050 - accuracy: 0.9682[1,0]:
572/4851 [==>...........................] - ETA: 27s - loss: 0.1032 - accuracy: 0.9688[1,0]:

I have tried different spark and horovod configurations, however horovod is not utilizing 2nd node

pSoLT
  • 1,042
  • 9
  • 18
  • What makes you think that the second GPU is not used? NCCL successfully inits on two ranks and it launches Parallel mode, connecting them into ring. – pSoLT Jan 10 '23 at 09:20
  • Thanks for the reply, I monitored the GPU usage on both nodes, on one node data is being loaded and the training is running (judging from the GPU usage and memory utilization), however, on the 2nd node the GPU is sitting idle. – Obaid Ur Rehman Jan 10 '23 at 14:20
  • Can you share the code? – pSoLT Jan 12 '23 at 09:01

0 Answers0