I want to train an RL agent without interactions with the environment using DQN and samples. In my understanding, DQN is an off-policy algorithm so that it seems possible. (Am I right?) However, I've failed to train so far. Specifically, the value of argmax of every state is the same. (It should be different in an optimal policy.)
My environment is as follows:
- State: 4 states (
A
,B
,C
,D
) - Action: 3 actions (
Stay
,Up
,Down
) - Reward & Transition:
B
is the terminal state. (The expression in the parentheses means (state, action, reward, next state).)- When you
Stay
inA
, you will be inA
and get 0 (A
,Stay
, 0,A
) - When you
Up
inA
, you will be inB
and get 0.33 (A
,Up
, 0.33,B
) - When you
Down
inA
, you will be inA
and get 0 (A
,Down
, 0,A
) - When you
Stay
inB
, you will be inB
and get 0.33 (B
,Stay
, 0.33,B
) - When you
Up
inB
, you will be inC
and get 0.25 (B
,Up
, 0.25,C
) - When you
Down
inB
, you will be inA
and get 0 (B
,Down
, 0,A
) - When you
Stay
inC
, you will be inC
and get 0.25 (C
,Stay
, 0.25,C
) - When you
Up
inC
, you will be inD
and get 0.2 (C
,Up
, 0.2,D
) - When you
Down
inC
, you will be inB
and get 0.33 (C
,Down
, 0.33,B
) - When you
Stay
inD
, you will be inD
and get 0.2 (D
,Stay
, 0.2,D
) - When you
Up
inD
, you will be inD
and get 0.2 (D
,Up
, 0.2,D
) - When you
Down
inD
, you will be inC
and get 0.25 (D
,Down
, 0.25,C
)
- When you
The way of How I trained:
- I put every sample above in the buffer memory.
- And then I use DQN to train. (No interaction with the environment)
Misc.
- Neural network
- Two layers (Input and output layer. No hidden layer between them)
- Optimizer: Adam
- Hyperparameters
- learning rate: 0.001
- batch size: varying between 2 and 12
Code Screenshots
Result
- Result screenshot
- The column is the action. (0:
Stay
, 1:Up
, 2:Down
) - The row is the state. (Some of them are different and some of them are the same).
- The argmax of every state is 1, which is not an optimal policy.
- Even though I run the loop more, the result is not changed.