I have 2 neural networks:
- Predicts action values Q(s, a) using off-policy reinforcement learning - Approximates the best response to an opponent's average behaviour.
- Imitate its own average best response behaviour using supervised classification.
This are my models (Keras):
# the best response network:
def _build_best_response_model(self):
input_ = Input(shape=self.s_dim, name='input')
hidden = Dense(self.n_hidden, activation='relu')(input_)
out = Dense(3, activation='relu')(hidden)
model = Model(inputs=input_, outputs=out, name="br-model")
model.compile(loss='mean_squared_error', optimizer=Adam(lr=self.lr_br), metrics=['accuracy'])
return model
# Average response network:
def _build_avg_response_model(self):
input_ = Input(shape=self.s_dim, name='input')
hidden = Dense(self.n_hidden, activation='relu')(input_)
out = Dense(3, activation='softmax')(hidden)
model = Model(inputs=input_, outputs=out, name="ar-model")
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.lr_ar), metrics=['accuracy'])
return model
As stated in Heinrich and Silver Paper "Deep Reinforcement Learning from Self-Play in Imperfect-Information Games" - the networks have to be updated as following:
I'm not sure if i have implemented it right - i'm sure categorical_crossentropy
and mean_squared_error
are the right loss functions. But i'm not sure if softmax
and relu
are the right activation functions to take.
As in the paper stated:
For learning in Leduc Hold’em, we manually calibrated NFSP for a fully connected neural network with 1 hidden layer of 64 neurons and rectified linear activations.
They use relu
as activation function but i guess they regard to the best response network
because it doesn't make sense to use relu
in supervised classification where i want to get a probability distribution over the possible actions.
I'm not able to reproduce the experiments in the paper and just want to be sure, the networks are set properly.
Cheers