I am trying to implement Q-learning, in an environment where R (rewards) are stochastich time-dependent variables, and they are arrive in real time, after const time interval deltaT. States S (scalars) also arrive after const time interval deltaT. The task for an agent is to give optimal action after it gets (S(ndeltaT),R(ndeltaT)).
My problem is that i am very new to RL, and i don't understand how this algo should be implemented, most papers describing Q-learning algo are in "scientific english" which is not helping me.
OnTimer() executes after fixed interval:
double a = 0.95;
double g = 0.95;
double old_state = 0;
action new_action = null;
action old_action = random_action;
void OnTimer()
{
double new_state = environment.GetNewState();
double Qmax = 0;
foreach(action a in Actions)
{
if(Q(new_state, a) > Qmax)
Qmax = Q(new_state, a);
new_action = a;
}
double reward = environment.Reward(old_state, old_action);
Q(old_state, old_action) = Q(old_state, old_action) + a*(reward + g*Qmax - Q(old_state, old_action));
old_state = new_state;
old_action = new_action;
agent.ExecuteInEnvironment(new_action);
}
Question:
Is this a proper implementation of online Q-learning, because it does not seem to work? Why is this not working optimal when n*deltaT -> inf, please help it is very important.