Highest Voted 'horovod' Questions

10

votes

2 answers

Distribute data from `tf.data.Dataset` to multiple workers (e.g. for Horovod)

With Horovod, you basically run N independent instances (so it is a form of between-graph replication), and they communicate via special Horovod ops (basically broadcast + reduce). Now let's say either instance 0, or some other external instance…

asked May 23 '20 at 17:18

Albert

65,406
61
242
386

9

votes

2 answers

How to check the version of NCCL

I remotely access High-performance computing nodes. I am not sure about NVIDIA Collective Communications Library (NCCL) is installed in my directory or not. Is there any way to check whether the NCCL is installed or not?

python tensorflow nvidia horovod

asked Apr 07 '21 at 11:05

Ahmad

645
2
6
21

7

votes

2 answers

Tensorflow Mirror Strategy and Horovod Distribution Strategy

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy. From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using…

tensorflow deep-learning mpi distributed-tensorflow horovod

asked Mar 05 '19 at 17:15

Md Kamruzzaman Sarker

2,387
3
22
38

4

votes

0 answers

ValueError: Items of feature_columns must be a _FeatureColumn. (Tensorflow 1.13)

I'm running into a ValueError when running Tensorflow-1.13 + Horovod-0.16 + Spark-0.24 + Petastorm-0.17. It's a straightforward implementation of a model_fn and some indicator_columns, but is throwing an error similar to Items of feature_columns…

apache-spark tensorflow tensorflow-estimator horovod petastorm

asked May 16 '19 at 21:52

Gan

41
2

3

votes

0 answers

Is it possible to use Open MPI in Docker with the default bridge network and host port forwarding?

I am trying to use Open MPI in Docker with containers on different hosts but connected to their respective default Docker bridge networks. There is a range of TCP ports that are mapped from the Docker host to the container. mpirun allows you to…

docker mpi openmpi horovod

asked Oct 24 '19 at 11:46

eigenstate47

31
1

3

votes

1 answer

ImportError: Extension horovod.tensorflow has not been built

Keep getting this error and I have reinstalled horovod and tensorflow multiple times. Please help! Traceback (most recent call last): File "train.py", line 3, in import horovod.tensorflow as hvd File…

python-3.x machine-learning horovod

asked May 24 '19 at 21:05

Tavishi Gupta

41
3

2

votes

0 answers

Horovod and Tensorflow ambiguous error (train_on_batch)

I'm trying to run distributed training with tensorflow.keras and horovod using a custom training loop (train_on_batch) with nvidia-docker on AWS p2.8xlarge. My code is a mess so posting it wouldn't be too useful. Here is a link to the output which…

python tensorflow keras tensorflow2.0 horovod

asked May 12 '21 at 13:02

Andrei.Mouraviev

113
2
9

2

votes

1 answer

A simple distributed training python program for deep learning models by Horovod on GPU cluster

I am trying to run some example python3 code https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html on databricks GPU cluster (with 1 driver and 2 workers). Databricks environment: ML 6.6, scala 2.11, Spark…

deep-learning gpu databricks horovod distributed-training

asked Jul 11 '20 at 21:15

user3448011

1,469
1
17
39

2

votes

1 answer

How to resume from a checkpoint when using Horovod with tf.keras?

Note: I'm using TF 2.1.0 and the tf.keras API. I've experienced the below issue with all Horovod versions between 0.18 and 0.19.2. Are we supposed to call hvd.load_model() on all ranks when resuming from a tf.keras h5 checkpoint, or are we only…

python tensorflow tensorflow2.0 tf.keras horovod

asked May 19 '20 at 17:13

user1414202

440
1
5
20

2

votes

0 answers

How to run Tensorflow - Spark jobs on Kubernetes using the Spark Operator?

My team is looking for a way to run Spark jobs that use the Tensorflow library on Kubernetes. We use the Spark Operator to run Spark on Kubernetes idiomatically. How should I go about creating a pod with the Spark job (PySpark + TF) and have it work…

apache-spark tensorflow kubernetes horovod

asked Jul 23 '19 at 16:30

Ramya Raj

109
1
10

2

votes

1 answer

How to fix : horovod.run.common.util.network.NoValidAddressesFound

I'm trying to make distributed learning with 2 nvidia docker. When I tried with 2 hosts it did not work. How do I fix this problem? I tried this command: horovodrun -np 3 -H localhost:1 -p 12345 python keras_mnist_advanced.py It worked, but when I…

python deep-learning nvidia horovod

asked Mar 30 '19 at 00:33

plusultra

41
4

1

vote

0 answers

Horovod torch estimator prepare_batch error

I'm trying to make a horovod torch estimator for a spark pipeline, but I'm getting an error while trying to fit the data and I don't know/understand the cause. I've left the full stack error here, but the final trace is as…

apache-spark pyspark pytorch horovod

asked Jan 16 '23 at 15:20

Voxeldoodle

73
7

1

vote

0 answers

Horovod Unable to use 2nd GPU worker

Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example (https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/tutorials/training/spark/) using tensorflow, and it was executed successfully on both…

tensorflow apache-spark horovod

asked Jan 05 '23 at 17:03

Obaid Ur Rehman

23
4

1

vote

0 answers

tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1

I am trying to utilize the multi-GPUs using Horovod for distributed training. Initially, I utilized a single GPU and two GPUs to test a simple convolution neural network. Everything functions properly. Then, I used CNN and LSTM in combination. It…

python tensorflow distributed-training horovod

asked Aug 14 '22 at 11:32

Ahmad

645
2
6
21

1

vote

1 answer

Building Azure Machine Learning environment (tensorflow) from dockerfile failing

I'm trying to create a new environment based on the TF 2.4 curated environment with opencv. Support for opencv is the only difference. I modified the dockerfile to include opencv as following: FROM…

azure tensorflow opencv azure-machine-learning-service horovod

asked Oct 12 '21 at 01:45

user2333716

43
5

Questions tagged [horovod]