Questions tagged [horovod]

42 questions
10
votes
2 answers

Distribute data from `tf.data.Dataset` to multiple workers (e.g. for Horovod)

With Horovod, you basically run N independent instances (so it is a form of between-graph replication), and they communicate via special Horovod ops (basically broadcast + reduce). Now let's say either instance 0, or some other external instance…
Albert
  • 65,406
  • 61
  • 242
  • 386
9
votes
2 answers

How to check the version of NCCL

I remotely access High-performance computing nodes. I am not sure about NVIDIA Collective Communications Library (NCCL) is installed in my directory or not. Is there any way to check whether the NCCL is installed or not?
Ahmad
  • 645
  • 2
  • 6
  • 21
7
votes
2 answers

Tensorflow Mirror Strategy and Horovod Distribution Strategy

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy. From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using…
4
votes
0 answers

ValueError: Items of feature_columns must be a _FeatureColumn. (Tensorflow 1.13)

I'm running into a ValueError when running Tensorflow-1.13 + Horovod-0.16 + Spark-0.24 + Petastorm-0.17. It's a straightforward implementation of a model_fn and some indicator_columns, but is throwing an error similar to Items of feature_columns…
3
votes
0 answers

Is it possible to use Open MPI in Docker with the default bridge network and host port forwarding?

I am trying to use Open MPI in Docker with containers on different hosts but connected to their respective default Docker bridge networks. There is a range of TCP ports that are mapped from the Docker host to the container. mpirun allows you to…
3
votes
1 answer

ImportError: Extension horovod.tensorflow has not been built

Keep getting this error and I have reinstalled horovod and tensorflow multiple times. Please help! Traceback (most recent call last): File "train.py", line 3, in import horovod.tensorflow as hvd File…
2
votes
0 answers

Horovod and Tensorflow ambiguous error (train_on_batch)

I'm trying to run distributed training with tensorflow.keras and horovod using a custom training loop (train_on_batch) with nvidia-docker on AWS p2.8xlarge. My code is a mess so posting it wouldn't be too useful. Here is a link to the output which…
2
votes
1 answer

A simple distributed training python program for deep learning models by Horovod on GPU cluster

I am trying to run some example python3 code https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html on databricks GPU cluster (with 1 driver and 2 workers). Databricks environment: ML 6.6, scala 2.11, Spark…
user3448011
  • 1,469
  • 1
  • 17
  • 39
2
votes
1 answer

How to resume from a checkpoint when using Horovod with tf.keras?

Note: I'm using TF 2.1.0 and the tf.keras API. I've experienced the below issue with all Horovod versions between 0.18 and 0.19.2. Are we supposed to call hvd.load_model() on all ranks when resuming from a tf.keras h5 checkpoint, or are we only…
user1414202
  • 440
  • 1
  • 5
  • 20
2
votes
0 answers

How to run Tensorflow - Spark jobs on Kubernetes using the Spark Operator?

My team is looking for a way to run Spark jobs that use the Tensorflow library on Kubernetes. We use the Spark Operator to run Spark on Kubernetes idiomatically. How should I go about creating a pod with the Spark job (PySpark + TF) and have it work…
Ramya Raj
  • 109
  • 1
  • 10
2
votes
1 answer

How to fix : horovod.run.common.util.network.NoValidAddressesFound

I'm trying to make distributed learning with 2 nvidia docker. When I tried with 2 hosts it did not work. How do I fix this problem? I tried this command: horovodrun -np 3 -H localhost:1 -p 12345 python keras_mnist_advanced.py It worked, but when I…
plusultra
  • 41
  • 4
1
vote
0 answers

Horovod torch estimator prepare_batch error

I'm trying to make a horovod torch estimator for a spark pipeline, but I'm getting an error while trying to fit the data and I don't know/understand the cause. I've left the full stack error here, but the final trace is as…
1
vote
0 answers

Horovod Unable to use 2nd GPU worker

Hi I have setup horovod on a k8s cluster with 2 GPU nodes using spark-operator. I have executed the mnist example (https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/tutorials/training/spark/) using tensorflow, and it was executed successfully on both…
1
vote
0 answers

tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1

I am trying to utilize the multi-GPUs using Horovod for distributed training.  Initially, I utilized a single GPU and two GPUs to test a simple convolution neural network. Everything functions properly. Then, I used CNN and LSTM in combination. It…
Ahmad
  • 645
  • 2
  • 6
  • 21
1
vote
1 answer

Building Azure Machine Learning environment (tensorflow) from dockerfile failing

I'm trying to create a new environment based on the TF 2.4 curated environment with opencv. Support for opencv is the only difference. I modified the dockerfile to include opencv as following: FROM…
1
2 3