I am trying to write a deep q-learning network for a problem in AI. I have a function predict()
that produces a tensor of shape (None, 3)
taking in an input of shape (None, 5)
. The 3 in (None, 3)
corresponds to the q-value of each action that can be taken at each state. Now, in the training step, I have to call predict()
multiple times and use the result to compute the cost and train the model. For doing this, I also have another data array available called current_actions
which is a list containing indices of actions taken for a particular state in the previous iterations.
What needs to happen is current_states_outputs
should be a tensor created from the output of predict()
in which each row contains only one q-value(as opposed to three from the output of predict()
) and which q-value should be selected should depend on the corresponding index of current_actions
.
For example, if current_states_output = [[1,2,3],[4,5,6],[7,8,9]]
and current_actions=[0,2,1]
, the result after the operation should be [1,6,8]
(updated)
How do I do this?
I have tried the following -
current_states_outputs = self.sess.run(self.prediction, feed_dict={self.X:current_states})
current_states_outputs = np.array([current_states_outputs[a][current_actions[a]] for a in range(len(current_actions))])
I basally ran the session on predict()
and did the required using normal python methords. But because this severs the connection of the cost from the previous layers of the graph, no training can be done. So, I need to do this operation staying within tensorflow and keeping everything as a tensorflow tensor itself. How can I manage this?