0

I'm trying to create a rudimentary Deep Q-Learning neural network for playing two-player heads-up texas hold 'em.

The model must, for a given state, produce a probability distribution for the set of possible actions, which for this example have been simplified to either fold, check, or the aggressive option (call/bet/raise). The ins-and-outs of how I'm planning to train the model aren't relevant for the purposes of this question.

Since the model weights will be randomly initialised, occasionally the model will attempt to perform an illegal move (such as attempting to 'check' when an opponent has raised, instead of calling or folding). When this happens, I want to punish the network and have it update its weights in order to 'filter out' illegal moves from the model's repertoire.

I'll assign the model a loss of 1.0 in this case, and am attempting to compute the gradients between the loss and the model parameters. But for some reason, whenever I try to do this, the gradients always end up being "None". What does this mean, and what am I doing wrong?

I'm relatively naïve when it comes to a lot of things in tensorflow, so it would be really helpful for me if you could not assume too much tensorflow knowledge and dumb down more advanced concepts.

Here's the code for the example I'm talking about:

from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras import Sequential, Input
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
import numpy as np


# Creating a linear model with a simple structure. The input size represents a combination of vectors that
## will be used to create the input state.
model = Sequential()
model.add(Input(shape=(52 + 52 + 2 + 3 + 3), batch_size=1))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(3))
# The model will produce probability values for choosing from three output actions (fold, check, call/bet/raise.) 
model.add(Activation('softmax'))
model.compile(loss="mse", optimizer=Adam(lr=0.001), metrics=['accuracy'])

# Bot hand is a 1-D boolean array of size 52. Each index represents a unique card, and a value of 1 in that index
## indicates the card is present in the hand.
bot_hand = np.zeros((52))

# For this example, we'll assume cards with ids 0 and 1 are in the bot's hand.
bot_hand[:2] = 1

# The same applies to the cards on the table; a 1-D boolean array of size 52.
table_cards = np.zeros((52))

# In this example, the bot has bet $2.
bot_bet_size = np.array([2])

# The opponent has bet $4.
opponent_bet_size = np.array([4])

# The last action the bot has ever made is represented as a 1-D boolean vector of size 3. A value of 1 in an 
## index represents which action has been taken. 0=fold, 1=check, 2=bet/raise.
bot_previous_action = np.zeros((3))

# The opponent's previous action is represented in the same way.
opponent_previous_action = np.zeros((3))

# In this example, the opponent's last action was to raise to $4.
opponent_previous_action[2] = 2

# The 'state' is represented by all of the above vectors.
bot_state = np.concatenate((bot_hand, table_cards, bot_bet_size, opponent_bet_size, bot_previous_action, 
                           opponent_previous_action), axis=0)

# The action taken is determined by which of the Softmax output nodes produces the highest input.
model_output = model.predict(bot_state.reshape(1, 52 + 52 + 2 + 3 + 3))
bot_decision = np.argmax(model_output)
                         
# Let's assume in this case the bot chose action '1' (check), which is an illegal move, since the opponents has
## raised. I am attempting to punish the network by assigning it a loss value of 1, and backpropagating to
## update the weights.
with tf.GradientTape() as t:
    # Creating dummy output 
    correct_output = model_output - 1
    # Ensuring that the loss is equal to 1.
    loss = tf.keras.losses.mse(correct_output, model_output)

gradients = t.gradient(loss, model.trainable_variables)

The 'gradients' variable always looks like [None, None, None, None, None, None]

I'd really appreciate some advice on how to fix this, or how to solve the problem of illegal moves in a different way.

Dylan
  • 48
  • 4

1 Answers1

0

Solved. There were multiple issues contributing to this problem, I'll list them all here, along with the code solution.

  1. the watch() method needed to be added to the GradientTape chunk, so that GradientTape was specifically told to track the gradients of model.trainable_variables.
  2. The output of the model needed to be a tensor obtained by model(input), instead of the numpy array obtained by model.predict(input).
  3. The output of the model also needed to be computed within GradientTape, as opposed to outside of it.
  4. The correct_output object needed to be a tensor instead of a numpy array.
  5. The MSE loss function is inappropriate for this model, since it has a Softmax output layer, and is closer to a classification network in architecture. The categorical_crossentropy loss function was used instead.

Complete code:

with tf.GradientTape(watch_accessed_variables=False) as t:
    t.watch(model.trainable_variables)
    model_output = model(bot_state.reshape(1, 52 + 52 + 2 + 3 + 3))
    # Creating dummy output - action 1 is illegal
    correct_output = tf.convert_to_tensor([[0.5, 0, 0.5]])
    # Calculating loss
    loss = tf.keras.losses.categorical_crossentropy(correct_output, model_output)

gradients = t.gradient(loss, model.trainable_variables)
Dylan
  • 48
  • 4