-2

I have been training a reinforcement learning agent to play ultimate-tictactoe (an expanded version of tic-tac-toe with a 9x9 board and additional rules).

I've created an openai gym environment, and have been trying to train the agent using stable_baselines3 PPO and DQN networks. However, the agent keeps choosing the same action for every state, even though the action is invalid most of the time.

I think the problem is being caused by my environment, as I have tried tweaking the hyperparameters of the training and have tried changing the type of training network. I have also tried changing the values of the reward in the environment, but haven't seen any improvement.

This is the constructor for my environment.

def __init__(self):
        super(UltimateTicTacToeEnv, self).__init__()

        self.reset()

        self.action_space = Discrete(81)  # 9 boards * 9 squares = 81 actions spaces.

        self.observation_space = Box(low=0, high=2, shape=(83,), dtype=np.int) # 81 squares from the board + pointer + current_player

This is the step method. Board is another class that handles all valid actions, current board, and modifications to the board. step method:

def step(self, action):
        reward = 0
        # Since the action is from 0 to 80, it gets the board and the square that the action corresponds to
        board = action // 9
        square = action % 9

        self.board.update()

        if self.board.isValid(board, square): # checks if the move is valid

            reward += 1 # increases the reward if the move is valid

            self.board.addValue(self.current_player, board, square) # adds move to board

            self.board.update() # updates the board with the action

            if (Board.hasWon(self.board.values[board]) == self.current_player):# checks if player won mini-3x3 in which action was played

                reward += 1 # increases reward if agent has won the mini 3x3 square in which the action was taken
            
            done, winner = self.check_game_over(board, square) # checks if game is over, and who won if it is over
            if done:
                if (winner == self.current_player): reward += 5 # reward for agent winning game

            self.current_player = 3 - self.current_player # switching between players
        else:
            reward -= 1 # reward is decreased if the agent takes an invalid action
            done = False

        return self.get_state(), reward, done, {} # get_state() returns a numpy array of length 83. The first 81 elements are the board. other 2 are pointer in which next move should be played, and current_player

This is the code I'm using to train PPO:

policy_kwargs = dict(
    net_arch=dict(pi=[83, 256, 256, 256, 81], vf=[83, 256, 256, 256, 81]),
)

model = PPO("MlpPolicy", env, verbose=1, learning_rate=2.5e-3, n_steps=2048, batch_size=64, 
            n_epochs=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, ent_coef=0.005, policy_kwargs=policy_kwargs, device="cuda")

And this is the code I'm using to train DQN:

policy_kwargs = dict(
    net_arch=[83, 256, 256, 256, 81],
)
model = DQN("MlpPolicy", env, verbose=1, learning_rate=2.5e-3, policy_kwargs=policy_kwargs, device='cuda')

Any suggestions on what might be causing the issue of the agent picking the same action for each state? And any suggestions on how to fix this?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Samatva K
  • 1
  • 1

1 Answers1

0

And this is the code I'm using to train DQN:

well, a possible explanation is that you do not actually train at all. You must call .learn method with the given amount of total_timesteps as a parameter to train your freshly instantiated model. Look here

gehirndienst
  • 424
  • 2
  • 13
  • I did actually call the `.learn` method with sufficient `total_timesteps`. I just didn't show it in my question. Thanks for the thought though. – Samatva K May 05 '23 at 16:48