I am implementing policy iteration in python for the gridworld environment as a part of my learning. I have written the following code:
### POLICY ITERATION ###
def policy_iter(grid, policy):
'''
Perform policy iteration to find the best policy and its value
'''
i = 1
while True:
policy_converged = True # flag to check if the policy imporved and break out of the loop
# evaluate the value function for the older policy
old_v = value_eval(grid, policy)
# evaluate the new policy
for s in states:
new_a = ""
best_v = float("-inf")
if grid.is_terminal(s):
continue
old_a = policy[s]
for a in ACTION_SPACE:
v = 0
for s2 in states:
env_prob = transition_probs.get((s,a,s2), 0)
reward = rewards.get((s,a,s2), 0)
v += env_prob * (reward + gamma*old_v[s2])
if v > best_v:
new_a = a
best_v = v
policy[s] = new_a
if new_a != old_a:
policy_converged = False
print(i, "th iteration")
i += 1
if policy_converged == True:
break
return policy
This code works fine. However, when I just change the placement of the '''policy_converged''' variable to be declared outside of the for loop,
def policy_iter(grid, policy):
'''
Perform policy iteration to find the best policy and its value
'''
i = 1
policy_converged = True
while True:
and the rest of the code remains the same. In this case, the program starts to go in an infinite loop and never stops even though I am changing the value of the flag based on the performance after each iteration inside the primary while loop. Why does this happen?