Supervised classification combined with off-policy reinforcement learning

Question

I have 2 neural networks:

Predicts action values Q(s, a) using off-policy reinforcement learning - Approximates the best response to an opponent's average behaviour.
Imitate its own average best response behaviour using supervised classification.

This are my models (Keras):

# the best response network:
def _build_best_response_model(self):
    input_ = Input(shape=self.s_dim, name='input')
    hidden = Dense(self.n_hidden, activation='relu')(input_)
    out = Dense(3, activation='relu')(hidden)

    model = Model(inputs=input_, outputs=out, name="br-model")
    model.compile(loss='mean_squared_error', optimizer=Adam(lr=self.lr_br), metrics=['accuracy'])
    return model

# Average response network:
def _build_avg_response_model(self):
    input_ = Input(shape=self.s_dim, name='input')
    hidden = Dense(self.n_hidden, activation='relu')(input_)
    out = Dense(3, activation='softmax')(hidden)

    model = Model(inputs=input_, outputs=out, name="ar-model")
    model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.lr_ar), metrics=['accuracy'])
    return model

As stated in Heinrich and Silver Paper "Deep Reinforcement Learning from Self-Play in Imperfect-Information Games" - the networks have to be updated as following:

I'm not sure if i have implemented it right - i'm sure categorical_crossentropy and mean_squared_error are the right loss functions. But i'm not sure if softmax and relu are the right activation functions to take.

As in the paper stated:

For learning in Leduc Hold’em, we manually calibrated NFSP for a fully connected neural network with 1 hidden layer of 64 neurons and rectified linear activations.

They use relu as activation function but i guess they regard to the best response network because it doesn't make sense to use relu in supervised classification where i want to get a probability distribution over the possible actions.

I'm not able to reproduce the experiments in the paper and just want to be sure, the networks are set properly.

Cheers

Does the two networks learn action policy and Q value. If so, why do you have output dimension 3 in both cases? — Krishna Kishore Andhavarapu, Jul 30 '18 at 11:39
Could you add your entire code in postscript? Maybe that'll help understand the problem better. — desert_ranger, Jul 15 '22 at 19:49

Supervised classification combined with off-policy reinforcement learning

0 Answers0