2

I am trying to implement Q-learning, in an environment where R (rewards) are stochastich time-dependent variables, and they are arrive in real time, after const time interval deltaT. States S (scalars) also arrive after const time interval deltaT. The task for an agent is to give optimal action after it gets (S(ndeltaT),R(ndeltaT)).

My problem is that i am very new to RL, and i don't understand how this algo should be implemented, most papers describing Q-learning algo are in "scientific english" which is not helping me.

OnTimer() executes after fixed interval:

double a = 0.95;
double g = 0.95;

double old_state = 0;
action new_action = null;
action old_action = random_action;

void OnTimer()
{
   double new_state = environment.GetNewState();
   double Qmax = 0;

   foreach(action a in Actions)
   {
      if(Q(new_state, a) > Qmax)
      Qmax = Q(new_state, a);
      new_action = a;
   }

   double reward = environment.Reward(old_state, old_action);

   Q(old_state, old_action) = Q(old_state, old_action) + a*(reward + g*Qmax - Q(old_state, old_action));

   old_state = new_state;
   old_action = new_action;

   agent.ExecuteInEnvironment(new_action);
}

Question:

Is this a proper implementation of online Q-learning, because it does not seem to work? Why is this not working optimal when n*deltaT -> inf, please help it is very important.

user2981093
  • 45
  • 2
  • 8

1 Answers1

0

It's hard to say exactly what's going wrong without more information, but it doesn't look like you've implemented the algorithm correctly. Generally, the algorithm is:

  1. Start out in an initial state as the current state.
  2. Select the next action from the current state using a learning policy (such as epsilon greedy). The learning algorithm will pick the action which will cause the transition from the current state to the next state.
  3. The (current state, action) pair will tell you what the next state is.
  4. Find Qmax (which I think you're doing correctly). One exception might be that Qmax should be 0 if the next state is a terminal state, but you might not have one.
  5. Get the reward for the (current state, action, next state) tuple. You seem to be ignoring the transition to the next state in your calculation.
  6. Update Q value for (old state, old action). I think you're doing this correctly.
  7. Set current state to next state
  8. Return to step 2, unless the current state is terminal.

Do you know the probability of your selected action actually causing your agent to move to the intended state, or is that something you have to estimate by observation? If states are just arriving arbitrarily and you don't have any control over what happens, this might not be an appropriate environment to apply reinforcement learning.

bpt3
  • 21
  • 1