0

I want to train an RL agent without interactions with the environment using DQN and samples. In my understanding, DQN is an off-policy algorithm so that it seems possible. (Am I right?) However, I've failed to train so far. Specifically, the value of argmax of every state is the same. (It should be different in an optimal policy.)

My environment is as follows:

  • State: 4 states (A,B,C,D)
  • Action: 3 actions (Stay, Up, Down)
  • Reward & Transition: B is the terminal state. (The expression in the parentheses means (state, action, reward, next state).)
    • When you Stay in A, you will be in A and get 0 (A, Stay, 0, A)
    • When you Up in A, you will be in B and get 0.33 (A, Up, 0.33, B)
    • When you Down in A, you will be in A and get 0 (A, Down, 0, A)
    • When you Stay in B, you will be in B and get 0.33 (B, Stay, 0.33, B)
    • When you Up in B, you will be in C and get 0.25 (B, Up, 0.25, C)
    • When you Down in B, you will be in A and get 0 (B, Down, 0, A)
    • When you Stay in C, you will be in C and get 0.25 (C, Stay, 0.25, C)
    • When you Up in C, you will be in D and get 0.2 (C, Up, 0.2, D)
    • When you Down in C, you will be in B and get 0.33 (C, Down, 0.33, B)
    • When you Stay in D, you will be in D and get 0.2 (D, Stay, 0.2, D)
    • When you Up in D, you will be in D and get 0.2 (D, Up, 0.2, D)
    • When you Down in D, you will be in C and get 0.25 (D, Down, 0.25, C)

The way of How I trained:

  • I put every sample above in the buffer memory.
  • And then I use DQN to train. (No interaction with the environment)

Misc.

  • Neural network
    • Two layers (Input and output layer. No hidden layer between them)
  • Optimizer: Adam
  • Hyperparameters
    • learning rate: 0.001
    • batch size: varying between 2 and 12

Code Screenshots

Result

  • Result screenshot
  • The column is the action. (0: Stay, 1: Up, 2: Down)
  • The row is the state. (Some of them are different and some of them are the same).
  • The argmax of every state is 1, which is not an optimal policy.
  • Even though I run the loop more, the result is not changed.
  • I just want to let you know that, if you have a _theoretical_ question about RL topics, [Artificial Intelligence SE](https://ai.stackexchange.com/) is the best site to ask it. Not sure if this is a theoretical question tough. – nbro Nov 03 '20 at 14:04
  • Thank you for letting me know, @nbro :D This question is not about a theoretical thing though. – Byungkwon Choi Nov 04 '20 at 01:04

1 Answers1

0

Sorry, but I can't do comments so here my suggestions:

  • Try to add another dense layer and increase the hidden nodes to better generalization;
  • Your system is deterministic and you have few possibilities (and consequently few samples to feed the replay memory), so in order to make your system learn, could be interesting increase the number of epochs a lot (try to use 200);
  • Could be helpful to add Dropout, for the same reasons above, but take it as a supplementary step;
  • Shuffle and pass the replay memory several times through the network;
  • Your learning rate seems very tiny for a not too complex task;
HenDoNR
  • 79
  • 1
  • 12