0

Have built a Reinforcement Learning DQN with variable length sequences as inputs, and positive and negative rewards calculated for actions. Some problem with my DQN model in Keras means that although the model runs, average rewards over time decrease, over single and multiple cycles of epsilon. This does not change even after significant period of training. Single cycle of epsilon, average rewards decreasing

Multiple cycles of epsilon, average rewards decreasing

My thinking is that this is due to using MeanSquareError in Keras as the Loss function (minimising error). So I am trying to implement gradient ascent (to maximise reward). How to do this in Keras? My current model is:

model = Sequential()
inp = (env.NUM_TIMEPERIODS, env.NUM_FEATURES)
model.add(Input(shape=inp))  # 'a shape tuple(integers), not including batch-size
model.add(Masking(mask_value=0., input_shape=inp))

model.add(LSTM(env.NUM_FEATURES, input_shape=inp, return_sequences=True))
model.add(LSTM(env.NUM_FEATURES))
model.add(Dense(env.NUM_FEATURES))
model.add(Dense(4))

model.compile(loss='mse,
              optimizer=Adam(lr=LEARNING_RATE, decay=DECAY),
              metrics=[tf.keras.losses.MeanSquaredError()])

In trying to implement gradient ascent, by 'flipping' the gradient (as negative or inverse loss?), I have tried various loss definitions:

loss=-'mse'    
loss=-tf.keras.losses.MeanSquaredError()    
loss=1/tf.keras.losses.MeanSquaredError()

but these all generate bad operand [for unary] errors.

How to adapt current Keras model to maximise rewards ? Or is this gradient ascent not even the problem? Could it be some issue with the action policy?

MarkD
  • 395
  • 3
  • 14

1 Answers1

1

Writing a custom loss function

Here is the loss function you want

@tf.function
def positive_mse(y_true, y_pred):
    return -1 * tf.keras.losses.MSE(y_true, y_pred)

And then your compile line becomes

model.compile(loss=positive_mse,
          optimizer=Adam(lr=LEARNING_RATE, decay=DECAY),
          metrics=[tf.keras.losses.MeanSquaredError()])

Please note : use loss=positive_mse and not loss=positive_mse(). That's not a typo. This is because you need to pass the function, not the results of executing the function.

Anton Codes
  • 3,663
  • 1
  • 19
  • 28
  • Thx - Trying to implement, may take some time, but will post results when I have them. Have upticked in the meantime – MarkD Nov 23 '20 at 11:03
  • if it answers your question then please mark it as the accepted answer as per the SO guidelines and practices https://stackoverflow.com/help/someone-answers – Anton Codes Nov 23 '20 at 17:38
  • Have accepted answer for custom loss function. However is unknown whether this specific function (here -MSE) resolves the decreasing average rewards in the DQN. Will update. – MarkD Nov 25 '20 at 11:50