0

What I have done

I'm using the DQN Algorithm in Stable Baselines 3 for a two players board type game. In this game, 40 moves are available, but once one is made, it can't be done again.

I trained my first model with an opponent which would choose randomly its move. If an invalid move is made by the model, I give a negative reward equal to the max score one can obtain and stop the game.

The issue

Once it's was done, I trained a new model against the one I obtained with the first run. Unfortunately, ultimately, the training process gets blocked as the opponent seems to loop an invalid move. Which means that, with all I've tried in the first training, the first model still predicts invalid moves. Here's the code for the "dumb" opponent :

while(self.dumb_turn):
    #The opponent chooses a move
    chosen_line, _states = model2.predict(self.state, deterministic=True)
    #We check if the move is valid or not
    while(line_exist(chosen_line, self.state)):
        chosen_line, _states = model2.predict(self.state, deterministic=True)
    #Once a good move is made, we registered it as a move and add it to the space state
    self.state[chosen_line]=1

What I would like to do but don't know how

A solution would be to set manually the Q-values to -inf for the invalid moves so that the opponent avoid those moves, and the training algorithm does not get stuck. I've been told how to access to these values :

import torch as th
from stable_baselines3 import DQN

model = DQN("MlpPolicy", "CartPole-v1")
env = model.get_env()

obs = env.reset()
with th.no_grad():
     obs_tensor, _ = model.q_net.obs_to_tensor(obs)
     q_values = model.q_net(obs_tensor)

But I don't know how to set them to -infinity.

If somebody could help me, I would be very grateful.

Lucas1283
  • 61
  • 1
  • 5

1 Answers1

1

I recently had a similar problem in which I needed to directly alter the q-values produced by the RL model during training in order to influence its actions.

To do this I overwritten some methods of the library:

# Imports
from stable_baselines3.dqn.policies import QNetwork, DQNPolicy

# Override some methods of the class QNetwork used by the DQN model in order to set to a negative value the q-values of
# some actions

# Two possibile methods to override:
# Override _predict ---> alter q-values only during predictions but not during training
# Override forward ---> alter q-values also during training (Attention: here we are working with batches of q-values)

class QNetwork_modified(QNetwork):
    
    def forward(self, obs: th.Tensor) -> th.Tensor:
        """
        Predict the q-values.
        :param obs: Observation
        :return: The estimated Q-Value for each action.
        """
        # Compute the q-values using the QNetwork
        q_values = self.q_net(self.extract_features(obs))
        # For each observation in the training batch:
        for i in range(obs.shape[0]):
            # Here you can alter q_values[i]

        
        return q_values

    
# Override the make_q_net method of the DQN policy used by the DQN model to make it use the new DQN network

class DQNPolicy_modified(DQNPolicy):
    def make_q_net(self) -> DQNPolicy:
        # Make sure we always have separate networks for features extractors etc
        net_args = self._update_features_extractor(self.net_args, features_extractor=None)
        return QNetwork_modified(**net_args).to(self.device)



model = DQN(DQNPolicy_modified, env, verbose=1)

Personally I don’t like too much this approach, and I would suggest you to try first some “more natural” alternatives, like for examples giving in input to your model also some kind of history of what actions have been already selected, in order to help the model learn that pre-selected actions should be avoided. For example you could enrich the input for the RL model with an additional binary mask where the moves already chosen have their corresponding bit set to 1. (In this case you should modify the gym environment).

Giullar
  • 11
  • 2