I am trying to train a neural network to efficiently explore a grid to locate an object using Keras and Keras-RL. Every "step", the agent chooses a direction to explore by choosing a number from 0
to 8
, where each corresponds to a cardinal or intermediate direction.
(Using reinforcement learning for this simple task is clearly not the best choice, as a simple algorithm could easily scan back and forth to achieve the goal. However, this serves more as a "tech demo" and challenge to myself.)
The following diagram represents all possible choices. 0
indicates northwest, 1
indicates north, 2
indicates northeast, etc. Note that 4
represents the choice to stay stationary.
0 1 2
3 4 5
6 7 8
The observation function returns the explored states of tiles within a certain radius of "vision" (.flatten()
ed), and the reward function simply returns the number of unexplored grid tiles within this radius.
In the following diagram, which uses a radius of 2
, █
represents an explored tile, ■
represents a tile within the radius of vision, o
represents the explorer, x
represents the desired object, and represents an entirely unexplored tile.
+--------+
|████████|
|██■■■█ |
|██■o■ |
| ■■■ |
| x|
+--------+
I am using the following model. During my experimentation, I typically use a 20x20 grid, with varying numbers of 16-node Dense
layers (arbitrarily chosen).
model = Sequential()
model.add(LSTM(2, input_shape=(1,) + observation_shape))
# observation_shape is the number of tiles being returned by observation
for _ in range(nb_dense):
model.add(Dense(dense_output))
# nb_dense and dense_output are varied manually, for testing purposes
model.add(Dense(nb_actions)) # output shape = 9 (number of directions)
model.add(Activation("linear"))
print(model.summary())
memory = SequentialMemory(limit=50000, window_length=1)
policy = EpsGreedyQPolicy(eps=.1)
dqn = DQNAgent(
model=model,
nb_actions=nb_actions,
memory=memory,
nb_steps_warmup=10,
target_model_update=1e-2,
policy=policy
)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=50000, visualize=False, verbose=2)
dqn.test(env, nb_episodes=5, visualize=True)
Unfortunately, even after much testing, the agent is still unable to find the object within any stretch of a reasonable amount of time, if at all.
- Is there something inherently wrong with my layer setup? (Should I use more/less
Dense
,LSTM
, etc.?) - Are my
SequentialMemory
,EpsGreedyQPolicy
,DQNAgent
, or.compile()
values non-ideal for the situation? - Is exploration itself too complex of a problem for such a simple network to solve?
In general, how could I improve the network so that it would actually succeed in exploration, finding the object in a relatively short amount of time regardless of its placement?