0

I am working on four functions to implement the MDP in python. I need help understanding how I would calculate the next state value V(s'). I know the equation Q_value = Reward + Y(Gamma)*(Summation(Probability of success * next_state)). And for the value iteration the max Q value is what is chosen and the policy will change based on that.

Here is what the grid graphics look like enter image description here This is the instruction: enter image description here

Here is the code implementation in Python.

from cell import states
import pygame
import drawfn

ACTION_EAST=0
ACTION_SOUTH=1
ACTION_WEST=2
ACTION_NORTH=3


TRANSITION_SUCCEED=0.8 #The probability that by taking action A, it moves to the expected destination state S'. Here the state S' represents the new state that the action A aims to move to.
TRANSITION_FAIL=0.2 #The probability that by taking action A, it moves to an unexpected destination state S'. For example, by taking action East, you may moves to the neighboring direction North or South. So the probability of going to North or South is 0.1. We assume the two directions evenly split the value of TRANSITION_FAIL 0.2
GAMMA=0.9 #the discount factor
ACTION_REWARD=-0.1 #The instantaneous for taking each action (we assume the four actions (N/E/W/S) has the same reward)
CONVERGENCE=0.0000001 #The threshold for convergence to determine a stop sign
cur_convergence=100

#####Implement the below functions ############################
#make sure the arrow will bounce back if the arrow points to the empty or the gray box
def computeQValue(s,action):
    print('Compute Q Values')
    #does not return anything 
    #try every action
    #s is state of each cell 
    #action from value 0-3 0-east, 1-south, 2-west, 3-north 
    #For each cell based on action taken the q value is calculated  
    #update the state data with the q value 
    # Compute Q-values for the given action and state
    global state_value
    global q_values
    global policy
    global transition 
    transition = {0:(1,0),1:(0,-1), 2:(-1,0),3:(0,1)}
    #loops through each row 
    for row in states:
        #loops through state in each row 
        for state in row:
          #find the next state expected value? 
            # Compute Q-values for the given action and state
            for i in range(4):
                #the 
                # if action == ACTION_EAST:
                #     q_values = ACTION_REWARD + GAMMA * (TRANSITION_SUCCEED * s.q_values[0] + TRANSITION_FAIL * s.q_values[1] + TRANSITION_FAIL * s.q_values[3])
                # elif action == ACTION_SOUTH:
                #     q_values = ACTION_REWARD + GAMMA * (TRANSITION_SUCCEED * s.q_values[1] + TRANSITION_FAIL * s.q_values[0] + TRANSITION_FAIL * s.q_values[2])
                # elif action == ACTION_WEST:
                #     q_values = ACTION_REWARD + GAMMA * (TRANSITION_SUCCEED * s.q_values[2] + TRANSITION_FAIL * s.q_values[1] + TRANSITION_FAIL * s.q_values[3])
                # else:
                #     q_values = ACTION_REWARD + GAMMA * (TRANSITION_SUCCEED * s.q_values[3] + TRANSITION_FAIL * s.q_values[2] + TRANSITION_FAIL * s.q_values[0])
                # s.q_values[action] = q_values

It would be a great help if you can help. Thank you! Feel free to ask for more detail if needed.

jwolf
  • 33
  • 3

0 Answers0