2

I have a question about the input and output (layer) of a DQN.

e.g

Two points: P1(x1, y1) and P2(x2, y2)

P1 has to walk towards P2

I have the following information:

  • Current position P1 (x/y)
  • Current position P2 (x/y)
  • Distance to P1-P2 (x/y)
  • Direction to P1-P2 (x/y)

P1 has 4 possible actions:

  • Up
  • Down
  • Left
  • Right

How do I have to setup the input and output layer?

  • 4 input nodes
  • 4 output nodes

Is that correct? What do I have to do with the output? I got 4 arrays with 4 values each as output. Is doing argmax on the output correct?

Edit:

Input / State:

# Current position P1
state_pos = [x_POS, y_POS]
state_pos = np.asarray(state_pos, dtype=np.float32)
# Current position P2
state_wp = [wp_x, wp_y]
state_wp = np.asarray(state_wp, dtype=np.float32)
# Distance P1 - P2 
state_dist_wp = [wp_x - x_POS, wp_y - y_POS]
state_dist_wp = np.asarray(state_dist_wp, dtype=np.float32)
# Direction P1 - P2
distance = [wp_x - x_POS, wp_y - y_POS]
norm = math.sqrt(distance[0] ** 2 + distance[1] ** 2)
state_direction_wp = [distance[0] / norm, distance[1] / norm]
state_direction_wp = np.asarray(state_direction_wp, dtype=np.float32)
state = [state_pos, state_wp, state_dist_wp, state_direction_wp]
state = np.array(state)

Network:

def __init__(self):
    self.q_net = self._build_dqn_model()
    self.epsilon = 1 

def _build_dqn_model(self):
    q_net = Sequential()
    q_net.add(Dense(4, input_shape=(4,2), activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
    rms = tf.optimizers.RMSprop(lr = 1e-4)
    q_net.compile(optimizer=rms, loss='mse')
    return q_net

def random_policy(self, state):
    return np.random.randint(0, 4)

def collect_policy(self, state):
    if np.random.random() < self.epsilon:
        return self.random_policy(state)
    return self.policy(state)

def policy(self, state):
    # Here I get 4 arrays with 4 values each as output
    action_q = self.q_net(state)
Tailor
  • 193
  • 1
  • 12

2 Answers2

2

Adding input_shape=(4,2) in the first Dense layer is causing the output shape to be (None, 4, 4). Defining q_net the following way solves it:

q_net = Sequential()
q_net.add(Reshape(target_shape=(8,), input_shape=(4,2)))
q_net.add(Dense(128,  activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
rms = tf.optimizers.RMSprop(lr = 1e-4)
q_net.compile(optimizer=rms, loss='mse')
return q_net

Here, q_net.add(Reshape(target_shape=(8,), input_shape=(4,2))) reshapes the (None, 4, 2) input to (None, 8) [Here, None represents the batch shape].

To verify, print q_net.output_shape and it should be (None, 4) [Whereas in the previous case it was (None, 4, 4)].

You also need to do one more thing. Recall that input_shape does not take batch shape into account. What I mean is, input_shape=(4,2) expects inputs of shape (batch_shape, 4, 2). Verify it by printing q_net.input_shape and it should output (None, 4, 2). Now, what you have to do is - add a batch dimension to your input. Simply you can do the following:

state_with_batch_dim = np.expand_dims(state,0)

And pass state_with_batch_dim to q_net as input. For example, you can call the policy method you wrote like policy(np.expand_dims(state,0)) and get an output of dimension (batch_shape, 4) [in this case (1,4)].

And here are the answers to your initial questions:

  1. Your output layer should have 4 nodes (units).
  2. Your first dense layer does not necessarily have to have 4 nodes (units). If you consider the Reshape layer, the notion of nodes or units does not fit there. You can think of the Reshape layer as a placeholder that takes a tensor of shape (None, 4, 2) and outputs a reshaped tensor of shape (None, 8).
  3. Now, you should get outputs of shape (None, 4) - there, the 4 values represent the q-values of 4 corresponding actions. No need to do argmax here to find the q-values.
1

It could make sense to feed the DQN some information on the direction it's currently facing too. You could set it up as (Current Pos X, Current Pos Y, X From Goal, Y From Goal, Direction).

The output layer should just be (Up, Left, Down, Right) in an order you determine. An Argmax layer is suitable for the problem. Exact code depends on if you using TF / Pytorch.

AverageHomosapien
  • 609
  • 1
  • 7
  • 15
  • Thansk for your answer. I'm using TF. I dont understand the output I get. 4 arrays because of 4 output nodes and 4 possible actions, right? But why do I get 4 values in each array? – Tailor Dec 01 '20 at 08:46
  • What is the shape of the Neural Network you are using? – AverageHomosapien Dec 01 '20 at 08:48
  • 1 Input layer with 4 nodes, 2 dense layers with 128 nodes each and 1 output layer with 4 nodes – Tailor Dec 01 '20 at 08:52
  • I'm struggling to understand why you're getting that output layer, I'm sorry to say. I've mostly used Pytorch. – AverageHomosapien Dec 01 '20 at 09:18
  • No problem. Normally with 4 output a.k.a 4 actions I would get 4 q-values, right? – Tailor Dec 01 '20 at 09:44
  • Yeah you should get a value for each action and then be able to Argmax and select the best action. – AverageHomosapien Dec 01 '20 at 10:03
  • Mhm, maybe because of my Input. The input looks like this: [[P1_x, P1_y], [P2_x, P2_y], [Dist_x, Dist_y], [Dir_x, Dir_y]]. – Tailor Dec 01 '20 at 10:10
  • Can you edit the post to include the code of the whole network? I'll have a look and see. Perhaps it is the input, seems strange to structure it like that. – AverageHomosapien Dec 01 '20 at 10:11
  • Is it okay like that? – Tailor Dec 01 '20 at 11:19
  • I'd try and avoid 2d inputs, especially in this format. I'd recommend keeping the number of input values as low as possible (you don't need 2 distance calculations passed, and I'd say that even the P2 location may be redundant - depending on task). I wouldn't be surprised if you're getting weird outputs bc of the 2d input – AverageHomosapien Dec 01 '20 at 16:13
  • Okay, so having an input like this: [P1_x, P1_y, P2_x, P2_y, Dist_x, Dist_y, Dir_x, Dir_y], could solve the problem? I will try it ! – Tailor Dec 01 '20 at 16:28