1

I have 2 neural networks:

  1. Predicts action values Q(s, a) using off-policy reinforcement learning - Approximates the best response to an opponent's average behaviour.
  2. Imitate its own average best response behaviour using supervised classification.

This are my models (Keras):

# the best response network:
def _build_best_response_model(self):
    input_ = Input(shape=self.s_dim, name='input')
    hidden = Dense(self.n_hidden, activation='relu')(input_)
    out = Dense(3, activation='relu')(hidden)

    model = Model(inputs=input_, outputs=out, name="br-model")
    model.compile(loss='mean_squared_error', optimizer=Adam(lr=self.lr_br), metrics=['accuracy'])
    return model

# Average response network:
def _build_avg_response_model(self):
    input_ = Input(shape=self.s_dim, name='input')
    hidden = Dense(self.n_hidden, activation='relu')(input_)
    out = Dense(3, activation='softmax')(hidden)

    model = Model(inputs=input_, outputs=out, name="ar-model")
    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.lr_ar), metrics=['accuracy'])
    return model

As stated in Heinrich and Silver Paper "Deep Reinforcement Learning from Self-Play in Imperfect-Information Games" - the networks have to be updated as following:

Update log loss and mean squared error

I'm not sure if i have implemented it right - i'm sure categorical_crossentropy and mean_squared_error are the right loss functions. But i'm not sure if softmax and relu are the right activation functions to take.

As in the paper stated:

For learning in Leduc Hold’em, we manually calibrated NFSP for a fully connected neural network with 1 hidden layer of 64 neurons and rectified linear activations.

They use relu as activation function but i guess they regard to the best response network because it doesn't make sense to use relu in supervised classification where i want to get a probability distribution over the possible actions.

I'm not able to reproduce the experiments in the paper and just want to be sure, the networks are set properly.

Cheers

Ioannis Nasios
  • 8,292
  • 4
  • 33
  • 55
David Joos
  • 926
  • 11
  • 16

0 Answers0