iterations and reward in q-learning

Question

Good morning, In Q-learning, the agents take actions until reaching their goal. The algorithm is executed many times until obtaining convergence. For example, the goal is to obtain a maximum throughput until the end of the time simulation. The simulation time is divided into n equal periods T and the reward varies over time. So, the agents update their states n times at the begenning of each period. In this case, n is considered as the number of steps or iterations? In addition, the update of the Q-value is done after executing the selected action or before the execution (using the reward function which is an approximation of the real reward)? I would be grateful if you can answer to my questions.

score 1 · Accepted Answer · edited Nov 15 '21 at 07:37

1

Firstly, you should known that in reinforcement learning there exists two kinds of tasks, one in which the agent-environment interaction naturally breaks down into a sequence of separate episodes (episodic tasks), and one in which it does not (continuing tasks) [Sutton book ref.].

The agent's goal is to maximize the total amount of reward it receives (in a simulation or in a real environment). This means maximizing not immediate reward, but cumulative reward in the long run.

In the case of an episodic task, each episode often has a different a different duration (e.g., if each episode is a chess game, each game usually finishes in a different number of movements).

The reward function doesn't change, but the reward recived by the agent changes depending on the actions it takes. In Q-learning algorithm, the agent updates the Q-function after each step (not at the beggining of each period/episode).

According to your definition, n is considered the number of steps per episode (which can vary from one episode to another, as previously stated). The total number of steps is the sum of n along all the episodes. The term 'iterations' maybe refers to the number of episodes in some papers/books, so it's necessary to know the context.

The update of the Q-function is performed after executing the selected action. Notice that the agent need to execute the current action to observe the reward and the next state.

The reward function is not an approximation of the real reward. There doesn't exist a real reward. The reward function is designed by the user to 'tell' the agent what the goal is. More on this topic again in the Sutton and Barto book: Section 3.2 Goals and Rewards.

edited Nov 15 '21 at 07:37

4Oh4

2,031
1
18
33

answered Dec 02 '16 at 08:47

Pablo EM

6,190
3
29
37

Good evening, thank you very much for your explanation. I think that in my case it is not the suitable way to model the problem with episodic tasks (because the goal is to obtain a maximum throughput during the simulation). So, the formulation with continuous tasks is more suitable. In this case, how the agents perform actions and is it possible to achieve convergence where all the agents have no interest to change their states? – student26 Dec 02 '16 at 21:51
Why the agent has no interest in change its value function? You should define a reward that reflects the agent's goal. For example, if your task is to operate a heating system (potentially a continuous task), maybe the agent receives a negative reward proportional yo the consumption, so that the agent tries to minimize the consumption. – Pablo EM Dec 03 '16 at 07:31
@student26, please let me know if I've answered your original question and, if necessary, feel free to open a new question about Q-learning implementation. Thx. – Pablo EM Dec 06 '16 at 13:21
good evening, thank you for your help and sorry for the delay for the response . My problem is that the goal is to maximize the total reward. So, I couldn't define the terminal condition. – student26 Dec 06 '16 at 17:11
Ok, but probably this comment is not directly related with your original post and you should open another question to maintain things clean and useful for other users. Maybe you can obtain a more accurate response giving more details about your setting (e.g., what is your environment? what is the agent's goal?). If you want to maintain your problem private, perhaps you can imagine a "toy example" that actually represents your needs (the same advice is also true for the RL mailing list). In any case, I think you are misunderstanding some general ideas about RL. Generally, in RL the PART 1/2 – Pablo EM Dec 06 '16 at 17:44
PART 2/2 goal is always to maximize the long term total reward (assuming infinite horizon setting, which is quite common but not the only option: it's also possible to have problems better described by a finite horizon). If the task is continous instead of episodic, the setting should be discounted infinite horizon reward. In such a case, you should stop the algorithm once convergence is reached. The convergence can be measured in different ways, but as I stated previously, it would be useful to know more information about your problem. – Pablo EM Dec 06 '16 at 17:44
Good afternoon, the problem is as follows: a set of agents have the access to 2 access points (AP) states to upload data. S={1,2} the set of states which refers to the connection to AP1 or 2. A={remain, change}. We suppose that during the total duration of simulation, the agents can access to the 2 APs. The goal is to upload the maximum of data during the simulation. – student26 Dec 07 '16 at 16:12
Hi again @student26, firstly, thanks for your explanation. However, as I stated in previous comments, you must open a new question and close this (accepting the answer if you consider it useful to your original question). In that way, we are following stackoverflow rules and probably you would have the chance to obtain help from others users in addition to my answers. – Pablo EM Dec 08 '16 at 13:21
Thank you for your reply. Ok, I will open a new question – student26 Dec 08 '16 at 13:31

iterations and reward in q-learning

1 Answers1