-1

I'm trying to use a neurosymbolic approach to solve the Frozenlake enviroment, using also stable baselines 3. I used the TransformReward on the enviroment, and seems that it's working (changing the reward values).

So here it is how it works the program:

It calculates a reward per step based on the distance of the next state to the goal state. Also I tried adding some more constraints, like punishing if it stays on the same square or if it falls into a hole. The thing is that I don't know if I'm doing something wrong, so if someone can help me would be much appreciated. Here is part of the code, I'll omit the neuro symbolic part because it's irrelevant.

The rewards are:

  • Taking a step in a direction that makes you near the goal: less than one (it depends on how near of the objective you are)
  • Stepping into a hole: -1
  • Reaching the goal: 2
  • Not moving (taking an action that makes you stay in the same square, like "pushing" the walls": -1

I have tried using only the first one (taking a step in a near objective direction) and a combination of the other ones.

Here's the code:

import scallopy
import gymnasium as gym

from operator import add
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.callbacks import CallbackList, EvalCallback, CheckpointCallback
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3 import A2C
from stable_baselines3.common.monitor import Monitor
from gymnasium.wrappers import TransformReward

CHECKPOINT_DIR = '/home/joaquin/TFM/NeuroRL/train/neuroA2C/FrozenLake-v1'
LOG_DIR = '/home/joaquin/TFM/NeuroRL/logs/neuroA2C/FrozenLake-v1'

total_timesteps = 5000000

env = gym.make('FrozenLake-v1', desc = None, map_name="4x4", is_slippery = False)

state = env.reset()
state = state[0]

map_layout = env.env.desc
map_layout = map_layout.flatten()
# env = Monitor(env)

# Enviroment data:
nrows = 4
ncols = 4
finish_square = (nrows-1)*nrows + (ncols-1)

def transform(number):
    trans_num = [(number, (i,)) for i in range(1)]
    return trans_num

scallop_finish = transform(finish_square)

def symbolic_reward(env):    
    
    global state

    exp = []
    # neuro_reward = 0

    action, _ = model.predict(state)
    action = int(action)
    next_state, reward, done, _, _ = env.step(action)

    scallop_next_state = transform(next_state)
               
    # Scallopy:                
    # Omitted the Scallopy part, the result is a value stored in exp that is divided by 1000 to get the value.
            
        exp = int(prob[0])
        neuro_reward = exp/1000

       # Check if it steps into a hole:         
        if map_layout[next_state] == b'H':
            neuro_reward = -1

       # Check if it reaches the goal:     
        if map_layout[next_state] == b'G':
            neuro_reward = 2 
           
        # Check if it move to a new square:    
        if state == next_state:
            neuro_reward = -1
        
    env = TransformReward(env, lambda r: neuro_reward)  
    state = next_state

    neuro_reward = 0
    
    return state
    
class symCallback(BaseCallback):
    
    def __init__(self, verbose=1):
        super(symCallback, self).__init__(verbose)

    def _on_step(self):
        symbolic_reward(env)
              
checkpoint_callback = CheckpointCallback(save_freq=50000, save_path=CHECKPOINT_DIR, 
                                          save_replay_buffer=True, save_vecnormalize=True)

# eval_callback = EvalCallback(env, best_model_save_path=CHECKPOINT_DIR, log_path=LOG_DIR, 
#                               eval_freq=5000, deterministic=True, render=False, verbose=1)

# callback = CallbackList([checkpoint_callback, eval_callback])
callback = CallbackList([checkpoint_callback, symCallback()])  
  
model = A2C('MlpPolicy', env, verbose=1)
# model = A2C.load('/home/joaquin/TFM/NeuroRL/train/neuroA2C/FrozenLake-v1/rl_model_4350000_steps', env) # Uncheck for continue training
model.learn(total_timesteps = total_timesteps, callback=callback, progress_bar=True)    
model.save('neuroA2C_SymbolicFrozenLake-v1')   

env.close()

Should I normalize the taking a step in the right direction rewards? Because now I feel that this way it also rewards more taking the shortest path.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Joaquin
  • 139
  • 1
  • 3
  • 12
  • " I don't know if I'm doing something wrong" This is not a programming question, open ended questions are not well suited for Stack Overflow. You are basically not asking anything. – Dr. Snoopy Aug 09 '23 at 19:02
  • @Dr.Snoopy I'm asking why is not working. It's a reward shaping related question, which involves programming. – Joaquin Aug 09 '23 at 19:16
  • Involving programming does not mean it is a programming question, for example the reason why it does not work could be theoretical, as I said, this question is not a good fit for this site. Also consider the open ended-ness is very problematic. Stack Overflow is about questions that can be answered. – Dr. Snoopy Aug 09 '23 at 19:17

0 Answers0