Trying to implement experience replay in Tensorflow

Question

I am trying to implement experience replay in Tensorflow. The problem I am having is in storing outputs for the models trial and then updating the gradient simultaneously. A couple approaches I have tried are to store the resulting values from sess.run(model), however, these are not tensors and cannot be used for gradient descent as far as tensorflow is concerned. I am currently trying to use tf.assign(), however, The difficulty I am having is best shown through this example.

import tensorflow as tf
import numpy as np

def get_model(input):
   return input

a = tf.Variable(0)
b = get_model(a)
d = tf.Variable(0)
for i in range(10):
   assign = tf.assign(a, tf.Variable(i))
   b = tf.Print(b, [assign], "print b: ")
   c = b
   d = tf.assign_add(d, c)
e = d

with tf.Session() as sess:
   tf.global_variables_initializer().run()
   print(sess.run(e))

The issue I have with the above code is as follows: -It prints different values on each run which seems odd -It does not correctly update at each step in the for loop Part of why I am confused is the fact that I understand you have to run the assign operation to update the prior reference, however, I just can't figure out how to correctly do that in each step of the for loop. If there is an easier way I am open to suggestions. This example is the same as how I am currently trying to feed in an array of inputs and get a sum based on each prediction the model makes. If clarification on any of the above would help I will be more than happy to provide it.

The following is the results from running the code above three times.

$ python test3.py
2018-07-03 13:35:08.380077: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
print b: [8]
80


$ python test3.py
2018-07-03 13:35:14.055827: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
print b: [7]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
print b: [6]
60

$ python test3.py
2018-07-03 13:35:20.120661: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
print b: [9]
90

The result I am expecting is as follows:

print b: [0]
print b: [1]
print b: [2]
print b: [3]
print b: [4]
print b: [5]
print b: [6]
print b: [7]
print b: [8]
print b: [9]
45

The main reason I am confused is that sometimes it provides all nines which makes me think that it loads the last value assigned 10 times, however, sometimes it loads different values which seems to contrast this theory.

What I would like to do is to feed in an array of input examples and compute the gradient for all examples at the same time. It needs to be concurrently because the reward used is dependent on the outputs of the model, so if the model changes the resulting rewards would also change.

You may want to edit your post to include the bad result you're getting and your expected result, to better show the error. — user15741, Jul 03 '18 at 17:34

score 0 · Answer 1 · answered Jul 14 '18 at 01:27

When you call tf.assign(a, tf.Variable(i)) this does not actually immediately assign the value of the second variable to the first one. It just create an operation in the NN to do the assignment when sess.run(...) is called.

When it is called all 10 assignments try to do their assignment at the same time. One of them randomly wins and then gets passed to the 10 assign_add operations which in effect multiplies it 10 times.

As to your motivating problem of implementing experience replay, most approaches I came across use tf.placeholder() to feed the experience buffer content into the network on training.

Trying to implement experience replay in Tensorflow

1 Answers1