0

I am new to RL and I am referring couple of books and tutorials, yet I have a basic question and I hope to find that fundamental answer here.

the primary book referred: Sutton & Barto 2nd edition and a blog

Problem description (only Q learning approach): The agent has to reach from point A to point B and it is in a straight line, point B is static and only the initial position of Agent is always random. -----------A(60,0)----------------------------------B(100,0)------------->

keeping it simple Agent always moves in the forward direction. B is always at X-axis position 100, which also a goal state and in first iteration A is at 60 X-axis position. So actions will be just "Go forward" and "Stop". Reward structure is to reward the agent 100 when A reaches point B and else just maintain 0, and when A crosses B it gets -500. So the goal for the Agent is to reach and stop at position B.

1)how many states would it require to go from point A to point B in this case? and how to define a Q and an R matrix for this? 2)How to add a new col and row if a new state is found?

Any help would be greatly appreciated.

Q_matrix implementation:

Q_matrix((find(List_Ego_pos_temp == current_state)) , 
                    possible_actions) = Q_matrix(find(List_Ego_pos_temp == current_state),possible_actions) + this.learning_rate * (Store_reward(this.Ego_pos_counter) + ...
                    this.discount * max(Q_matrix(find(List_Ego_pos_temp == List_Ego_pos_temp(find(current_state)+1))),possible_actions) - Q_matrix((find(List_Ego_pos_temp == current_state)) , possible_actions));

This implementation is in matlab. List_Ego_pos_temp is a temporary list which store all the positions of the Agent.

Also, lets say there are ten states 1 to 10 and we also know that with what speed and distance the agent moves in each state to reach till state 10 and the agent always can move only sequentially which means agent can go from s1 to s2 to s3 to s4 till 10 not s1 to s4 or s10. lets say at s8 is the goal state and Reward = 10, s10 is a terminal state and reward is -10, from s1 to s7 it receives reward of 0. so will it be a right approach to calculate a Q table if the current state is considered as state1 and the next state is considered as state2 and in the next iteration current state as state2 and the next state as state3 and so on? will this calculate the Q table correctly as the next state is already fed and nothing is predicted?

1 Answers1

0

Since you are defining the problem in this case, many of the variables are dependent on you.

  1. You can define a minimum state (for e.g. 0) and a maximum state (for e.g. 150) and define each step as a state (so you could have 150 possible states). Then 100 will be your goal state. Then your action will be defined as +1 (move one step) and 0 (stop). Then the Q matrix will be a 150x2 matrix for all possible states and all actions. The reward will be scalar as you have defined.

  2. You do not need to add new column and row, since you have the entire Q matrix defined.

Best of luck.

shunyo
  • 1,277
  • 15
  • 32
  • Thank you, lets say in the same problem if there are three states, s1 = A_position, s2= B_position-10meters and s3 = B_position. And the agent can go from s1 to s2 to get reward = 100, s2 to s3 reward=-100 and s1 to s3 reward = -500. How can I control how long the agent can travel ahead if the action "Go ahead " was taken? should that be modified in the environment creation engine? – Arshad Pasha Aug 02 '18 at 15:12
  • No I am talking of making it a discrete approach. So have from 0 to 150 with A initialized randomly before 100. Then have 100 as goal state. Action is going forward 1 step. Then you can code it up. – shunyo Aug 02 '18 at 21:20
  • Thank you @Shunyo, What is the approach to define the states? would it be a matrix as we define for a grid world problem or another way? if I have to define the states considering the distance and the speed of the agent. – Arshad Pasha Aug 03 '18 at 21:41
  • It is a 1-D problem. So it will be just a line (x-axis if you will) of points from 0-150. Then the action will be going one step up or staying there. – shunyo Aug 05 '18 at 19:48
  • Also if this solves your problem, please do accept the answer. – shunyo Aug 05 '18 at 19:48
  • I tried for 1000 iter Q matrix gets updated with values but does not behave the required way.I used the ´Q_matrix((find(List_Ego_pos_temp == current_state)) , possible_actions) = Q_matrix(find(List_Ego_pos_temp == current_state),possible_actions) + this.learning_rate * (Store_reward(this.Ego_pos_counter) + this.discount * max(Q_matrix(find(List_Ego_pos_temp == List_Ego_pos_temp(find(current_state)+1))),possible_actions) - Q_matrix((find(List_Ego_pos_temp == current_state)) , possible_actions))` Possible_actions are either 1 (brake) or 2(accelerate) which are chosen at 0.1 to 0.9 probability.. – Arshad Pasha Aug 07 '18 at 09:15
  • Can you edit the question and then update with the Q-matrix calculation. I cannot figure out what is written. – shunyo Aug 07 '18 at 17:30
  • Looking at your edited question, it still seems you have not understood basic Q-learning. So I would advise you to work through the basic gridworld example (possibly using Andrew Ng's video lecture) and then trying to tackle this problem. – shunyo Aug 10 '18 at 18:46
  • Thanks, I just figured out the way it could be solved, I found that the program always entered a reward loop, I corrected that. Since I already had the possible states, I created a counter which navigates through those states and monitors the corresponding actions and then assigns specific rewards. And then also get the next states. So all parameters to compute a Q value are obtained. Anyways, appreciate your advise. – Arshad Pasha Aug 12 '18 at 13:58