1

In the Q-learning algorithm, there is a reward function that rewards the action taken on the current state. My question is can I have a non-deterministic reward function that is affected by the time when an action on a state is performed.

For example, suppose the reward for an action taken on a state at time 1PM is r(s,a). After several iterations (suppose now at 3PM), the system touches the same state and performs the same action as it did at 1PM. Should the reward given at 3PM must be the same as the one given at 1PM? Or the reward function can be designed by taking time into consideration (i.e., the reward given on the same state and the same action but at different time can be different).

Above is the question I want to ask, and one more thing I want to say is I don't want to treat time as a characteristic of a state. It is because in this case none of the state can be the same (time is always increasing).

Richard Hu
  • 811
  • 5
  • 18

1 Answers1

1

My first though was your last sentence, i.e., to include the time as part of the state. As you said, time is always increasing, but it is also cyclical. So, maybe your reward function could depend on some repetitive feature of time. For example, everyday is 3PM at some point.

On the other hand, the reward function could be stochastic, there is no limitation to deterministic functions. However, take into account that the policy will tend to optimize the expected returns. Therefore, if your agent is obtaining a totally different reward each time it visits the same [state, action] pair, probably there is something wrong in the way you are modelling your environment.

Pablo EM
  • 6,190
  • 3
  • 29
  • 37
  • 1
    Thanks for your answer! I think you are right and I should consider time as part of the state space. Actually a more precise way to describe the problem I am working on is that the reward function should not only consider the current state but also the history of previous states. That means when I go to the same state but at different time, the rewards could be different because the state transitions could be different. So may I ask is there any RL algorithms that take history into consideration? – Richard Hu Aug 27 '19 at 08:02
  • 1
    Glad to be of help! Regarding your question, you have achieved an important conclusion. In fact, in RL Sutton & Barto book (Section 3.1) you can read: "In a Markov decision process, the probabilities given by `p` (where `p` defines the env dynamics) completely characterize the environment’s dynamics. That is, the probability of each possible value for S_t and R_t depends only on the immediately preceding state and action, S_{t-1} and A_{t-1}, and, given them, not at all on earlier states and actions". That is, in an ideal situation your state should contain all info about previous states. – Pablo EM Aug 27 '19 at 09:01
  • 1
    So, I recommend you read Section 3.1 of that book (it's free, here: http://incompleteideas.net/book/RLbook2018.pdf), understand what is the **Markov property**, why is important for the theoretical convergence properties of RL algorithms, and what methods you should use when your environment violates the Markov property (Chapter 17 of the same book). – Pablo EM Aug 27 '19 at 09:06
  • 1
    It's surprising to see an answer like this! Following your suggestion, it seems section 17.3 is what I need. Thanks again for your kind help! – Richard Hu Aug 27 '19 at 14:46