I was trying to implement a q-learning algorithms in Keras. According to the articles i found these lines of code.
for state, action, reward, next_state, done in sample_batch:
target = reward
if not done:
#formula
target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
target_f = self.brain.predict(state)
#shape (1,2)
target_f[0][action] = target
print(target_f.shape)
self.brain.fit(state, target_f, epochs=1, verbose=0)
if self.exploration_rate > self.exploration_min:
self.exploration_rate *= self.exploration_decay
Variable sample_batch
is the array that contains sample state, action, reward, next_state, done
from collected data.
I also found the following q-learning formula
Why is there no -
sign in the equation(code)? I found out that np.amax
returns the maximum of an array or maximum along an axis. When i call self.brain.predict(next_state)
, I get [[-0.06427538 -0.34116858]]
. So it plays the role of prediction in this equation? As we go forward target_f
is the predicted output for the current state and then we also append to it the reward with this step. Then, we train model on current state
(X
) and target_f
(Y
). I have a few questions. What is the role of self.brain.predict(next_state)
and why there is no minus? Why do we predict twice on one model? Ex self.brain.predict(state) and self.brain.predict(next_state)[0]