Questions tagged [horovod]
42 questions
10
votes
2 answers
Distribute data from `tf.data.Dataset` to multiple workers (e.g. for Horovod)
With Horovod, you basically run N independent instances (so it is a form of between-graph replication), and they communicate via special Horovod ops (basically broadcast + reduce).
Now let's say either instance 0, or some other external instance…

Albert
- 65,406
- 61
- 242
- 386
9
votes
2 answers
How to check the version of NCCL
I remotely access High-performance computing nodes. I am not sure about NVIDIA Collective Communications Library (NCCL) is installed in my directory or not. Is there any way to check whether the NCCL is installed or not?

Ahmad
- 645
- 2
- 6
- 21
7
votes
2 answers
Tensorflow Mirror Strategy and Horovod Distribution Strategy
I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy.
From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using…

Md Kamruzzaman Sarker
- 2,387
- 3
- 22
- 38
4
votes
0 answers
ValueError: Items of feature_columns must be a _FeatureColumn. (Tensorflow 1.13)
I'm running into a ValueError when running Tensorflow-1.13 + Horovod-0.16 + Spark-0.24 + Petastorm-0.17. It's a straightforward implementation of a model_fn and some indicator_columns, but is throwing an error similar to Items of feature_columns…

Gan
- 41
- 2
3
votes
0 answers
Is it possible to use Open MPI in Docker with the default bridge network and host port forwarding?
I am trying to use Open MPI in Docker with containers on different hosts but connected to their respective default Docker bridge networks. There is a range of TCP ports that are mapped from the Docker host to the container.
mpirun allows you to…

eigenstate47
- 31
- 1
3
votes
1 answer
ImportError: Extension horovod.tensorflow has not been built
Keep getting this error and I have reinstalled horovod and tensorflow multiple times. Please help!
Traceback (most recent call last):
File "train.py", line 3, in
import horovod.tensorflow as hvd
File…

Tavishi Gupta
- 41
- 3
2
votes
0 answers
Horovod and Tensorflow ambiguous error (train_on_batch)
I'm trying to run distributed training with tensorflow.keras and horovod using a custom training loop (train_on_batch) with nvidia-docker on AWS p2.8xlarge. My code is a mess so posting it wouldn't be too useful. Here is a link to the output which…

Andrei.Mouraviev
- 113
- 2
- 9
2
votes
1 answer
A simple distributed training python program for deep learning models by Horovod on GPU cluster
I am trying to run some example python3 code
https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html
on databricks GPU cluster (with 1 driver and 2 workers).
Databricks environment:
ML 6.6, scala 2.11, Spark…

user3448011
- 1,469
- 1
- 17
- 39
2
votes
1 answer
How to resume from a checkpoint when using Horovod with tf.keras?
Note: I'm using TF 2.1.0 and the tf.keras API. I've experienced the below issue with all Horovod versions between 0.18 and 0.19.2.
Are we supposed to call hvd.load_model() on all ranks when resuming from a tf.keras h5 checkpoint, or are we only…

user1414202
- 440
- 1
- 5
- 20
2
votes
0 answers
How to run Tensorflow - Spark jobs on Kubernetes using the Spark Operator?
My team is looking for a way to run Spark jobs that use the Tensorflow library on Kubernetes. We use the Spark Operator to run Spark on Kubernetes idiomatically.
How should I go about creating a pod with the Spark job (PySpark + TF) and have it work…

Ramya Raj
- 109
- 1
- 10
2
votes
1 answer
How to fix : horovod.run.common.util.network.NoValidAddressesFound
I'm trying to make distributed learning with 2 nvidia docker. When I tried with 2 hosts it did not work. How do I fix this problem?
I tried this command:
horovodrun -np 3 -H localhost:1 -p 12345 python keras_mnist_advanced.py
It worked, but when I…

plusultra
- 41
- 4
1
vote
0 answers
Horovod torch estimator prepare_batch error
I'm trying to make a horovod torch estimator for a spark pipeline, but I'm getting an error while trying to fit the data and I don't know/understand the cause.
I've left the full stack error here, but the final trace is as…

Voxeldoodle
- 73
- 7
1
vote
0 answers
Horovod Unable to use 2nd GPU worker
Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example (https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/tutorials/training/spark/) using tensorflow, and it was executed successfully on both…

Obaid Ur Rehman
- 23
- 4
1
vote
0 answers
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
I am trying to utilize the multi-GPUs using Horovod for distributed training. Initially, I utilized a single GPU and two GPUs to test a simple convolution neural network. Everything functions properly. Then, I used CNN and LSTM in combination. It…

Ahmad
- 645
- 2
- 6
- 21
1
vote
1 answer
Building Azure Machine Learning environment (tensorflow) from dockerfile failing
I'm trying to create a new environment based on the TF 2.4 curated environment with opencv. Support for opencv is the only difference. I modified the dockerfile to include opencv as following:
FROM…

user2333716
- 43
- 5