I'm trying to understand parallelism on GPU in tensorflow as I need to apply it on uglier graphs.
import tensorflow as tf
from datetime import datetime
with tf.device('/device:GPU:0'):
var = tf.Variable(tf.ones([100000], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)
@tf.function
def foo():
return tf.while_loop(c, b, [i], parallel_iterations=1000) #tweak
@tf.function
def b(i):
var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(i, [-1,1]), tf.constant([0], dtype=tf.dtypes.float32)))
return tf.add(i,1)
with tf.device('/device:GPU:0'):
i = tf.constant(0)
c = lambda i: tf.less(i,100000)
start = datetime.today()
with tf.device('/device:GPU:0'):
foo()
print(datetime.today()-start)
In the code above, var is a tensor with length 100000, whose elements are updated as shown above. When I change the parallel_iterations values from 10, 100, 1000, 10000. There's hardly any time difference (all at 9.8s) even though explicitly mentioning the parallel_iterations variable.
I want these to happen parallely on GPU. How can I implement it?