I use Q-learning in order to determine the optimal path of an agent. I know in advance that my path is composed of exactly 3 states (so after 3 states I reach a terminal state). I would like to know how to include that in the updating rule of the q-function.
What I am doing currently:
for t=1:Nb_Epochs-1
if rand(1)<Epsilon
an action 'a' is chosen at random
else
[Maxi a]=max(QIter(CurrentState,:,t));
end
NextState=FindNextState(CurrentState,a);
QIter(CurrentState,a,t+1)=(1-LearningRate)*QIter(CurrentState,a,t)+LearningRate*(Reward(NexState)+Discount*max(QIter(NextState,:,t)));
CurrentState=NextState;
end