What I have done
I'm using the DQN Algorithm in Stable Baselines 3 for a two players board type game. In this game, 40 moves are available, but once one is made, it can't be done again.
I trained my first model with an opponent which would choose randomly its move. If an invalid move is made by the model, I give a negative reward equal to the max score one can obtain and stop the game.
The issue
Once it's was done, I trained a new model against the one I obtained with the first run. Unfortunately, ultimately, the training process gets blocked as the opponent seems to loop an invalid move. Which means that, with all I've tried in the first training, the first model still predicts invalid moves. Here's the code for the "dumb" opponent :
while(self.dumb_turn):
#The opponent chooses a move
chosen_line, _states = model2.predict(self.state, deterministic=True)
#We check if the move is valid or not
while(line_exist(chosen_line, self.state)):
chosen_line, _states = model2.predict(self.state, deterministic=True)
#Once a good move is made, we registered it as a move and add it to the space state
self.state[chosen_line]=1
What I would like to do but don't know how
A solution would be to set manually the Q-values to -inf for the invalid moves so that the opponent avoid those moves, and the training algorithm does not get stuck. I've been told how to access to these values :
import torch as th
from stable_baselines3 import DQN
model = DQN("MlpPolicy", "CartPole-v1")
env = model.get_env()
obs = env.reset()
with th.no_grad():
obs_tensor, _ = model.q_net.obs_to_tensor(obs)
q_values = model.q_net(obs_tensor)
But I don't know how to set them to -infinity.
If somebody could help me, I would be very grateful.