In the Q-learning algorithm, there is a reward function that rewards the action taken on the current state. My question is can I have a non-deterministic reward function that is affected by the time when an action on a state is performed.
For example, suppose the reward for an action taken on a state at time 1PM is r(s,a). After several iterations (suppose now at 3PM), the system touches the same state and performs the same action as it did at 1PM. Should the reward given at 3PM must be the same as the one given at 1PM? Or the reward function can be designed by taking time into consideration (i.e., the reward given on the same state and the same action but at different time can be different).
Above is the question I want to ask, and one more thing I want to say is I don't want to treat time as a characteristic of a state. It is because in this case none of the state can be the same (time is always increasing).