I'm trying to use a neurosymbolic approach to solve the Frozenlake enviroment, using also stable baselines 3. I used the TransformReward on the enviroment, and seems that it's working (changing the reward values).
So here it is how it works the program:
It calculates a reward per step based on the distance of the next state to the goal state. Also I tried adding some more constraints, like punishing if it stays on the same square or if it falls into a hole. The thing is that I don't know if I'm doing something wrong, so if someone can help me would be much appreciated. Here is part of the code, I'll omit the neuro symbolic part because it's irrelevant.
The rewards are:
- Taking a step in a direction that makes you near the goal: less than one (it depends on how near of the objective you are)
- Stepping into a hole: -1
- Reaching the goal: 2
- Not moving (taking an action that makes you stay in the same square, like "pushing" the walls": -1
I have tried using only the first one (taking a step in a near objective direction) and a combination of the other ones.
Here's the code:
import scallopy
import gymnasium as gym
from operator import add
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.callbacks import CallbackList, EvalCallback, CheckpointCallback
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3 import A2C
from stable_baselines3.common.monitor import Monitor
from gymnasium.wrappers import TransformReward
CHECKPOINT_DIR = '/home/joaquin/TFM/NeuroRL/train/neuroA2C/FrozenLake-v1'
LOG_DIR = '/home/joaquin/TFM/NeuroRL/logs/neuroA2C/FrozenLake-v1'
total_timesteps = 5000000
env = gym.make('FrozenLake-v1', desc = None, map_name="4x4", is_slippery = False)
state = env.reset()
state = state[0]
map_layout = env.env.desc
map_layout = map_layout.flatten()
# env = Monitor(env)
# Enviroment data:
nrows = 4
ncols = 4
finish_square = (nrows-1)*nrows + (ncols-1)
def transform(number):
trans_num = [(number, (i,)) for i in range(1)]
return trans_num
scallop_finish = transform(finish_square)
def symbolic_reward(env):
global state
exp = []
# neuro_reward = 0
action, _ = model.predict(state)
action = int(action)
next_state, reward, done, _, _ = env.step(action)
scallop_next_state = transform(next_state)
# Scallopy:
# Omitted the Scallopy part, the result is a value stored in exp that is divided by 1000 to get the value.
exp = int(prob[0])
neuro_reward = exp/1000
# Check if it steps into a hole:
if map_layout[next_state] == b'H':
neuro_reward = -1
# Check if it reaches the goal:
if map_layout[next_state] == b'G':
neuro_reward = 2
# Check if it move to a new square:
if state == next_state:
neuro_reward = -1
env = TransformReward(env, lambda r: neuro_reward)
state = next_state
neuro_reward = 0
return state
class symCallback(BaseCallback):
def __init__(self, verbose=1):
super(symCallback, self).__init__(verbose)
def _on_step(self):
symbolic_reward(env)
checkpoint_callback = CheckpointCallback(save_freq=50000, save_path=CHECKPOINT_DIR,
save_replay_buffer=True, save_vecnormalize=True)
# eval_callback = EvalCallback(env, best_model_save_path=CHECKPOINT_DIR, log_path=LOG_DIR,
# eval_freq=5000, deterministic=True, render=False, verbose=1)
# callback = CallbackList([checkpoint_callback, eval_callback])
callback = CallbackList([checkpoint_callback, symCallback()])
model = A2C('MlpPolicy', env, verbose=1)
# model = A2C.load('/home/joaquin/TFM/NeuroRL/train/neuroA2C/FrozenLake-v1/rl_model_4350000_steps', env) # Uncheck for continue training
model.learn(total_timesteps = total_timesteps, callback=callback, progress_bar=True)
model.save('neuroA2C_SymbolicFrozenLake-v1')
env.close()
Should I normalize the taking a step in the right direction rewards? Because now I feel that this way it also rewards more taking the shortest path.