Understanding device allocation, parallelism(tf.while_loop) and tf.function in tensorflow

Question

I'm trying to understand parallelism on GPU in tensorflow as I need to apply it on uglier graphs.

import tensorflow as tf
from datetime import datetime

with tf.device('/device:GPU:0'):
    var = tf.Variable(tf.ones([100000], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)

@tf.function
def foo():
    return tf.while_loop(c, b, [i], parallel_iterations=1000)      #tweak

@tf.function
def b(i):
    var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(i, [-1,1]), tf.constant([0], dtype=tf.dtypes.float32)))
    return tf.add(i,1)

with tf.device('/device:GPU:0'):
    i = tf.constant(0)
    c = lambda i: tf.less(i,100000)

start = datetime.today()
with tf.device('/device:GPU:0'):
    foo()
print(datetime.today()-start)

In the code above, var is a tensor with length 100000, whose elements are updated as shown above. When I change the parallel_iterations values from 10, 100, 1000, 10000. There's hardly any time difference (all at 9.8s) even though explicitly mentioning the parallel_iterations variable.

I want these to happen parallely on GPU. How can I implement it?

I'm guessing that it considers those ops to be dependent (therefor not parallelizable). https://github.com/tensorflow/tensorflow/issues/1984 — Reza Roboubi, May 01 '20 at 21:48

Reza Roboubi · Answer 1 · 2020-05-02T08:32:39.397

First, notice that your tensor_scatter_nd_update is just incrementing a single index, therefor you could only be measuring the overhead of the loop itself.

I modified your code to do it with a much larger batch size. Running in Colab under a GPU, I needed batch=10000 to hide the loop latency. Anything below that measures (or pays for) a latency overhead.

Also, the question is, does var.assign(tensor_scatter_nd_update(...)) actually prevent the extra copy made by tensor_scatter_nd_update? Playing with batch size shows that indeed we're not paying for extra copies, so the extra copy seems to be prevented very nicely.

However, it turns out that in this case, apparently, tensorflow just considers the iterations to be dependent on each other, therefor it doesn't make any difference (at least in my test) if you increase the loop iterations. See this for further discussion on what TF does: https://github.com/tensorflow/tensorflow/issues/1984

It does things in parallel only if they are independent (operations).

BTW, an arbitrary scatter op isn't going to be very efficient on a GPU, but you still might be (should be) able to perform several in parallel if TF considers them independent.

import tensorflow as tf
from datetime import datetime

size = 1000000
index_count = size
batch = 10000
iterations = 10

with tf.device('/device:GPU:0'):
    var = tf.Variable(tf.ones([size], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)
    indexes = tf.Variable(tf.range(index_count, dtype=tf.dtypes.int32), dtype=tf.dtypes.int32)
    var2 = tf.Variable(tf.range([index_count], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)

@tf.function
def foo():
    return tf.while_loop(c, b, [i], parallel_iterations = iterations)      #tweak

@tf.function
def b(i):
    var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(indexes, [-1,1])[i:i+batch], var2[i:i+batch]))
    return tf.add(i, batch)

with tf.device('/device:GPU:0'):
    i = tf.constant(0)
    c = lambda i: tf.less(i,index_count)

start = datetime.today()
with tf.device('/device:GPU:0'):
    foo()
print(datetime.today()-start)

score 0 · Answer 2 · answered Apr 12 '19 at 17:17

One technique is to use a distribution strategy and scope:

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.2))

Another option is to duplicate the operations on each device:

# Replicate your computation on multiple GPUs
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)

See this guide for more details

Understanding device allocation, parallelism(tf.while_loop) and tf.function in tensorflow

2 Answers2

Linked