Questions tagged [distributed-tensorflow]

Use TensorFlow on multiple machines/devices.

Distributed TensorFlow is a set of techniques, that allow TensorFlow library to utilize multiple machines and/or devices simultaneously or sequentially. It can be used to build and train and deploy ML models, build ETL pipelines, or perform any computations. It covers tf.distribute.Strategy API and any older methods.

21 questions
7
votes
2 answers

Tensorflow Mirror Strategy and Horovod Distribution Strategy

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy. From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using…
2
votes
1 answer

Tensorflow: how to manually shard a dataset

I'm using the MirroredStrategy to perform multi-gpu training and it doesn't appear to be properly sharding the data. How do you go about manually sharding data? I know that I could use the shard method for a tf.data dataset, but for that I need…
Luke
  • 6,699
  • 13
  • 50
  • 88
2
votes
0 answers

Tensorflow CentralStorageStrategy

The tf.distribute.experimentalCentralStorageStrategy specifies that Variables are not mirrored, instead, they are placed on CPU and ops are replicated across all GPUs. If I have a really big model that does not fit on any single GPU, could this be a…
Jack Shi
  • 23
  • 1
  • 5
1
vote
0 answers

How to build a TensorFlow cluster and let each node can make a connection to any rest of the nodes (1 to N-1)?

How to build a TensorFlow cluster and let each node make a connection to any rest of the nodes (1 to N-1)? I check the code and its implementation is server-client with gRPC. Does that mean I should build a server and a client on each node so that…
skytree
  • 1,060
  • 2
  • 13
  • 38
1
vote
1 answer

How to test distributed layers on Tensorflow?

I am trying to test a layer that I will add later in a distributed model however I want to be sure that it works before. This is the layer in question: class BNShuffler(tf.Module): def __init__( self, global_batch_size: int=64 …
1
vote
0 answers

How to broadcast with distributed TensorFlow

I want to implement broadcast some values from chief to all workers with distributed TensorFlow like MPI's bcast: https://mpi4py.readthedocs.io/en/stable/tutorial.html#collective-communication I guess broadcast_send or tf.raw_ops.CollectiveBcastSend…
1
vote
0 answers

some question about grpc+gdr and grpc+verbs in using distributed tensorflow

when i use distributed tensorflow, grpc+gdr is worse than grpc+verbs, but nv_peer_mem is loaded,and i don't know the difference of grpc+verbs and grpc+gdr? anyone can help me? and some output is as below: root@s36-2288H-V5:~# /etc/init.d/nv_peer_mem…
der Liu
  • 11
  • 1
1
vote
0 answers

Simple way to use a single GPU over IP in tensorflow

I have been searching the web up and down but can't seem to find a simple answer. Basically, I have a desktop with one GPU, and a laptop where my main code is at. My goal is to use distributed tensorflow to execute python code on my laptop while…
Binary
  • 451
  • 5
  • 15
1
vote
1 answer

Distributed Keras MultiWorkerMirroredStrategy doesn't work with embedding_column converts from variable-length input feature

I am trying TensorFlow 2.0 and testing the distributed solution of keras, but I face a problems: embedding_column converts from variable-length input feature doesn't work with Distributed Keras MultiWorkerMirroredStrategy . With local…
FelixHo
  • 1,254
  • 14
  • 26
1
vote
0 answers

Is TLS supported in Distributed Tensorflow gRPC communication

I was wondering is TLS supported in current distributed tensorflow with gRPC? I am reading through the code, https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h#L105 the implementation of…
JRH
  • 53
  • 5
1
vote
0 answers

How to create a custom distribution strategy on Tensorflow

I', looking to write a custom distribution strategy for tensorflow. Currently there are several types strategies available (MirroredStrategy, TPUStrategy, ...), but i would like to implement a new way of distributing training across several…
Joao P
  • 38
  • 1
  • 6
1
vote
1 answer

implementing mask-r-cnn with tensorflow-distributed

I'm training a mask-r-cnn network, which is built on tensorflow and keras. I'm searching for a way to reduce training time, so I thought implementing it with tensorflow-distributed. I've been working with mask-r-cnn for some time, but it seems what…
1
vote
0 answers

Distributed Tensorflow error: Check failed: DeviceNameUtils::ParseFullName(new_base, &parsed_name)

Trying to run a distributed tensorflow example on CPU from: https://github.com/tmulc18/Distributed-TensorFlow-Guide/blob/master/Distributed-Setup/dist_setup.py Commands to run the example can be found…
1
vote
0 answers

How to run multiprocessing python with distributed tensorflow on slurm

I want to run a multiprocessing distributed tensorflow program on slurm. The script should use python multiprocessing library to open up different sessions on different nodes in parallel. This approach works when testing using slurm interactive…
1
vote
0 answers

Distributed execution under eager mode using tensorflow

According to a recently published white paper and the RFC on GitHub, tensorflow eager currently supports distributed execution. It is mentioned that, similar to the graph mode, we can run an operation eagerly on a remote device by setting the device…
1
2