I'm working on a project where I want to train an agent to find optimal routes in a road network (Graph). I build the custom Env with OpenAI Gym, and I'm building the model and training the agent with Keras and Keras-rl respectively.
The problem is that pretty much every example I found about Deep Q Learning with Keras is with a fix set of possible actions. But in my case, the number of possible actions will change from node to node. For example: At the start node you might have 2 nodes to go as the available steps. But later you might be in a node that has 4 possible nodes to go to.
I saw that an approach to this was to mark the impossible steps with a negative reward but this doesn't sound that optimal.
I found out that you can use space.Discrete().sample(mask) to act as a filter of possible actions. The mask is an np.array([1,1,0,0,0,0,0,0,0])
where 1 means the corresponding action is possible and 0 that it isn't. This works when testing my custom Env and I don't have to redeclare the action space.
But how do I implement this to the agent training process? since it always picks one of the 10 possible actions (because that's the parameter for DQNAgent()
), resulting sometimes on an IndexError: list index out of range
because the possible steps is a list with the node neighbors.
Here is some of the code:
def build_model(env):
model = Sequential()
input_shape = (1, env.observation_space.shape[0]) # this results in (1,8)
model.add(Flatten(input_shape=input_shape))
model.add(Dense(24, activation='relu'))
model.add(Dense(24, activation='relu'))
n_output_nodes = env.action_space.n
model.add(Dense(n_output_nodes, activation='linear'))
return model
def build_agent(model, actions):
policy = BoltzmannQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(
model=model,
memory=memory,
policy=policy,
nb_actions=actions,
nb_steps_warmup=10,
target_model_update=1e-2,
)
return dqn
The model and the agent are build as such
model = build_model(env)
dqn = build_agent(model, env.action_space.n)
dqn.compile(Adam(learning_rate=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=50000, visualize=False, verbose=1)