I am working on four functions to implement the MDP in python. I need help understanding how I would calculate the next state value V(s'). I know the equation Q_value = Reward + Y(Gamma)*(Summation(Probability of success * next_state)). And for the value iteration the max Q value is what is chosen and the policy will change based on that.
Here is what the grid graphics look like
This is the instruction:
Here is the code implementation in Python.
from cell import states
import pygame
import drawfn
ACTION_EAST=0
ACTION_SOUTH=1
ACTION_WEST=2
ACTION_NORTH=3
TRANSITION_SUCCEED=0.8 #The probability that by taking action A, it moves to the expected destination state S'. Here the state S' represents the new state that the action A aims to move to.
TRANSITION_FAIL=0.2 #The probability that by taking action A, it moves to an unexpected destination state S'. For example, by taking action East, you may moves to the neighboring direction North or South. So the probability of going to North or South is 0.1. We assume the two directions evenly split the value of TRANSITION_FAIL 0.2
GAMMA=0.9 #the discount factor
ACTION_REWARD=-0.1 #The instantaneous for taking each action (we assume the four actions (N/E/W/S) has the same reward)
CONVERGENCE=0.0000001 #The threshold for convergence to determine a stop sign
cur_convergence=100
#####Implement the below functions ############################
#make sure the arrow will bounce back if the arrow points to the empty or the gray box
def computeQValue(s,action):
print('Compute Q Values')
#does not return anything
#try every action
#s is state of each cell
#action from value 0-3 0-east, 1-south, 2-west, 3-north
#For each cell based on action taken the q value is calculated
#update the state data with the q value
# Compute Q-values for the given action and state
global state_value
global q_values
global policy
global transition
transition = {0:(1,0),1:(0,-1), 2:(-1,0),3:(0,1)}
#loops through each row
for row in states:
#loops through state in each row
for state in row:
#find the next state expected value?
# Compute Q-values for the given action and state
for i in range(4):
#the
# if action == ACTION_EAST:
# q_values = ACTION_REWARD + GAMMA * (TRANSITION_SUCCEED * s.q_values[0] + TRANSITION_FAIL * s.q_values[1] + TRANSITION_FAIL * s.q_values[3])
# elif action == ACTION_SOUTH:
# q_values = ACTION_REWARD + GAMMA * (TRANSITION_SUCCEED * s.q_values[1] + TRANSITION_FAIL * s.q_values[0] + TRANSITION_FAIL * s.q_values[2])
# elif action == ACTION_WEST:
# q_values = ACTION_REWARD + GAMMA * (TRANSITION_SUCCEED * s.q_values[2] + TRANSITION_FAIL * s.q_values[1] + TRANSITION_FAIL * s.q_values[3])
# else:
# q_values = ACTION_REWARD + GAMMA * (TRANSITION_SUCCEED * s.q_values[3] + TRANSITION_FAIL * s.q_values[2] + TRANSITION_FAIL * s.q_values[0])
# s.q_values[action] = q_values
It would be a great help if you can help. Thank you! Feel free to ask for more detail if needed.