Implementing Dropout in a Keras DQN (reinforcement learning)

Question

First I have to say that i know, that Dropout is not common in reinforcement learning (RL). Here you can read more about the topic and why it maybe makes sense:

https://towardsdatascience.com/generalization-in-deep-reinforcement-learning-a14a240b155b

I am not sure how to implement Dropout in a Keras DQN. Usually (in supervised learning) Keras takes care on the task of turning the Dropout layer on/off, depending on whether you are training or testing. In my case (trading with RL) i do train on TRAIN data and TEST on holdout data and they are NOT equal. The model does over fit the TRAIN data and does not generalize well. I can see that it over fits, just by viewing the train results - it memorizes and perfectly trades the trained data. That's why i want to use Dropout.

EDIT: Since "krenerd" gave a different way to implement the function, I'll summarize all 4 (to me) known ways here:

WAY 1: Using K.set_learning_phase() with K.set_learning_phase(0) or K.set_learning_phase(1)

WAY 2: (https://stackoverflow.com/a/57439143/11122466)

Using a K (backend) function:

func = K.function(model.inputs + [K.learning_phase()], model.outputs)

run the model with dropout layers being active, i.e. learning_phase=1

preds = func(list_of_input_arrays + [1])

run the model with dropout layers being inactive, i.e. learning_phase=0

preds = func(list_of_input_arrays + [0])

WAY 3: (https://stackoverflow.com/a/57439143/11122466) "Another approach is to define a new model with the same architecture but without setting training=True, and then transfer the weights from the trained model to this new model." This is very slow for me, around 1.5 ms for per copy. Because it is slow i don't like that solution.

Way 4 (suggested by "krenerd"): "Call the model with model(x,training=True)" Here i get a Value Error: Layer INPUT was called with an input that isn't a symbolic tensor. Received type: <class 'numpy.ndarray'>. My input is a numpy array, i have to cast that to a tensor.

If you look at the following simple example for a DQN, modified/ taken from:

https://github.com/keon/deep-q-learning/blob/master/dqn.py

class DQNAgent:

    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()

    def _build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='linear'))
        model.add(Dropout(0.05))
        model.add(ReLU())
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse',optimizer=Adam(lr=self.learning_rate))
        return model    

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])  # returns action

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma *
                          np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

ON TRAIN DATA: We call predict() twice in the replay() function, once for "state" and once for "next_state" (to create our target/label) and we call predict() in act(). Do we enable Dropout for both predict() calls in replay()? Do we enable Dropout for the predict() call in act()?

ON TEST DATA: No Exploration, Epsilon = 0. Only act() is used to evaluate the performance on unseen data. The data from TEST is NOT saved in the replay buffer. I think we do not use Dropout here?

2 Questions:

How/where would you insert the Dropout layer? Before or after the first Dense layer and/or before/after an Activation layer?
Which predict() calls have to be made with training = True / learning_phase = 1 and which not?

few questions : dimensions of your state space and action space ? where in the code do you update your target estimator , if yes how often ? how big is your buffer size ? also the way one should evaluate a DQN agent is by extracting the policy and evaluating that policy against another policy which you think is better. — Siddhant Tandon, Dec 11 '20 at 14:37
Dear siddhant, thanks for your comment. Sorry, but "dimensions of the state / action space and buffer size" are not necessary to answer this question? I don't understand why you need this information? There is no "target estimator", the example i have shown is a simple DQN. And it is not "my code", it is a simple EXAMPLE to explain the question of implementing DROPOUT. "the way one should evaluate a DQN agent is by extracting the policy and evaluating that policy against another policy" -> I don't understand that, as the agent is used to find a policy - because i don't have a better one? — anna12345, Dec 14 '20 at 14:44
Hi Anna, Actually it simply does not make sense to use dropout in a DQN task. Dropout is essentially to avoid overfitting and that too in supervised learning scenarios. The whole idea of DQN is to make Q-Learning look more like supervised learning. In DQN this is done by using a target estimator, and other DQN params that i asked you. In the world of RL there is no such train and test data. You might refer to [this answer](https://ai.stackexchange.com/questions/8293/why-do-you-not-see-dropout-layers-on-reinforcement-learning-examples) for more info. — Siddhant Tandon, Dec 15 '20 at 17:22
What I mean about policy is that, you must have another policy to compare your DQN policy against. Lets call this your "benchmark" policy. For every state,action taken you evaluate your DQN policy against benchmark policy and see which policy gives you better rewards. Basically its policy evaluation for DQN against benchmark policy. Your benchmark could be SARSA, or even Q-learning on discretised state and action space. — Siddhant Tandon, Dec 15 '20 at 17:25
Dear siddhant, thanks again for your answer and taking the time. Have you read link from my question: towardsdatascience.com/… ? Quote: "Recently, researchers have begun to systematically explore generalization in RL by developing novel simulated environments that enable creating a distribution of MDPs and splitting unique training and testing instances." The main question is: How would you detect over fitting, if you test your agent on the data you have trained it on? Answer: You need some holdout data. But if you detect it, how would you prevent the agent from over fitting? — anna12345, Dec 16 '20 at 15:13
Thanks for the link to the other question, I've the same problem: "created an environment that simulates currency prices and a simple agent, using DQN, that attempts to learn when to buy and sell". The agent does over fit, it perfectly trades the data it was trained on. I can see that without testing it on different data. But if i test it on unseen data, the performance goes down, that is not unexpected. With dropout i want to force it to find a more generalized policy. There is no "benchmark policy" in trading, you can use a BUY/Hold strategy but how does that explain if the agent over fit? — anna12345, Dec 16 '20 at 15:20
"In DQN this is done by using a target estimator, and other DQN params that i asked you." I use 23 inputs and 6 actions, buffer 100000, how does that help you to answer the question? Maybe i don't get the point or we misunderstand each other? "In the world of RL there is no such train and test data." vs. "Dropout is essentially to avoid overfitting" Not in the textbook examples, that is true! But outside of the textbook world are other fields where RL is used. If you just train on data and test/use the agent on the same data (pong/cartpole/breakout) you will never recognize if it over fits. — anna12345, Dec 16 '20 at 15:30
Can we end the theoretical discussion of whether "dropout makes sense here"? There are some indications that this technique could lead to success, I just want to know how to implement it correctly. It is very clear to me that this area of application is outside of the normal specifications. And I know that solving these problems is more an art than science. My aim is to find out how to implement this technically, not to discuss whether it makes sense in theory. It may not work, but I would like to test that myself. To be able to test it, however, it must be implemented correctly. — anna12345, Dec 16 '20 at 15:49
In the original DQN paper the state space and action space are continuous. You have a discrete state and action space. A linear function approximator would work just fine and would be much less of a pain to train. Regarding your first question, in case of MLPs dropout is usually added after every dense layer just before the non linear activations. Some discussion [here](https://stats.stackexchange.com/questions/240305/where-should-i-place-dropout-layers-in-a-neural-network). — Siddhant Tandon, Dec 16 '20 at 16:55
"You have a discrete state and action space." My action space is discrete, my state space is continuous. That's one reason why I use an approximation. Feel free to try using a linear function in forex trading, I have serious doubts that you will be successful with it. Thanks for the hint and the link, then it fits as shown in the code above :). — anna12345, Dec 17 '20 at 07:24
"a linear approximator would work just fine" -- your conclusion is incorrect, some (almost all functions but the few linear ones, LOL) functions are clearly not linear. The approach is also problematic, as it tries to perform extrapolation (in this regard, I would start linear, deep NNs are terrible for that). I completely disagree with the dropout discussion though, I see pseudo-theoretical explanations regarding why it's bad all the time (increases variance, etc.), yet there are popular, peer-reviewed papers that show otherwise. It depends on the problem. — Maayao, Aug 03 '23 at 10:59

score 0 · Answer 1 · answered Dec 08 '20 at 01:02

0

model.predict(x) automatically neglects the training only layers, such as the Dropout or BatchNormalization in inference. If you want the prediction with dropout applied, you must use model(x,training=True). Code with training=fasle, model(x,training=False) does the same thing as model.predict(x).

So for all model.predict() used in your training, you must replace it to model(x,training=True).

answered Dec 08 '20 at 01:02

krenerd

741
4
22

"So for all model.predict() used in your training, you must replace it to model(x,training=True)." Thanks for the answer, can you please explain why? I get different answers from different people, some say you can use the normal predict() with no Dropout here. They say that Dropout is only used while fit() is running. That's the confusing point... – anna12345 Dec 10 '20 at 14:25
Dear krenerd, your answer includes a lot of stuff i want to know. Why do you think that: "for all model.predict() used in training, you must replace it to model(x,training=True)". It would be very nice if you could explain that in more detail, with a plausible explanation I would accept the answer. But I want to understand what I'm doing. – anna12345 Dec 16 '20 at 16:42
@anna12345 Sorry for a late response. I was busy for a while. The we can refer to the `model.predict()` function as an inference function, typically used at test time. Some operations such as the `Droupout` or the `BatchNormalization` are designed to work only at training, but not when testing/evaluating. – krenerd Dec 22 '20 at 07:53
@anna12345 You used the `model.predict` function while training, although Python expected that you were evaluating on a trained model and therefore neglects unnecessary layers: the dropout layer. By predicting with `model(x,training=True)`, you can tell the model not to turn off the dropout layer:) – krenerd Dec 22 '20 at 07:55

Implementing Dropout in a Keras DQN (reinforcement learning)

1 Answers1