0

I'm working on a project where I want to train an agent to find optimal routes in a road network (Graph). I build the custom Env with OpenAI Gym, and I'm building the model and training the agent with Keras and Keras-rl respectively.

The problem is that pretty much every example I found about Deep Q Learning with Keras is with a fix set of possible actions. But in my case, the number of possible actions will change from node to node. For example: At the start node you might have 2 nodes to go as the available steps. But later you might be in a node that has 4 possible nodes to go to.

I saw that an approach to this was to mark the impossible steps with a negative reward but this doesn't sound that optimal.

I found out that you can use space.Discrete().sample(mask) to act as a filter of possible actions. The mask is an np.array([1,1,0,0,0,0,0,0,0]) where 1 means the corresponding action is possible and 0 that it isn't. This works when testing my custom Env and I don't have to redeclare the action space.

But how do I implement this to the agent training process? since it always picks one of the 10 possible actions (because that's the parameter for DQNAgent()), resulting sometimes on an IndexError: list index out of range because the possible steps is a list with the node neighbors.

Here is some of the code:

def build_model(env):
    model = Sequential()
    input_shape = (1, env.observation_space.shape[0]) # this results in (1,8)
    model.add(Flatten(input_shape=input_shape))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    n_output_nodes = env.action_space.n
    model.add(Dense(n_output_nodes, activation='linear'))
    return model


def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(
        model=model, 
        memory=memory, 
        policy=policy,
        nb_actions=actions, 
        nb_steps_warmup=10, 
        target_model_update=1e-2,
    )
    return dqn

The model and the agent are build as such

model = build_model(env)
dqn = build_agent(model, env.action_space.n)
dqn.compile(Adam(learning_rate=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=50000, visualize=False, verbose=1)
Aldair CB
  • 135
  • 1
  • 1
  • 6

1 Answers1

-2

Your issue can be resolved by setting the value of the logits of your neural network that doesn't correspond to the mask to (-inf), this interprets to the softmax function ignoring their values in the output of your network,

but since you are using a linear activation function you can simply set the values of non desirable actions at the current step to zero and leave the values of nodes that your mask implies as is. this will result with the agent always picking from the range on Q-values of actions that are available in this training step.

Here's an example of how you can achieve this:

import numpy as np

logits = np.array([-2, 1, 0, 3])
mask = np.array([0, 1, 1, 0])

# Set non-desirable logits to zero using the mask
logits *= mask

print(logits)

The output will be:

[-0.  1.  0.  0.]

In this example, the logits array represents the Q-values of different actions, and the mask array indicates which actions are desirable (1) or non-desirable (0). By element-wise multiplying the logits with the mask, the non-desirable logits are set to zero.

As a result, the output logits only contain values from the desirable actions, while the non-desirable actions have been effectively eliminated from consideration.

Yami
  • 1
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 15 '23 at 18:20