Q-learning to learn minesweeping behavior

Question

I am attempting to use Q-learning to learn minesweeping behavior on a discreet version of Mat Buckland's smart sweepers, the original available here http://www.ai-junkie.com/ann/evolved/nnt1.html, for an assignment. The assignment limits us to 50 iterations of 2000 moves on a grid that is effectively 40x40, with the mines resetting and the agent being spawned in a random location each iteration.

I've attempted performing q learning with penalties for moving, rewards for sweeping mines and penalties for not hitting a mine. The sweeper agent seems unable to learn how to sweep mines effectively within the 50 iterations because it learns that going to specific cell is good, but after a the mine is gone it is no longer rewarded, but penalized for going to that cell with the movement cost

I wanted to attempt providing rewards only when all the mines were cleared in an attempt to make the environment static as there would only be a state of not all mines collected, or all mines collected, but am struggling to implement this due to the agent having only 2000 moves per iteration and being able to backtrack, it never manages to sweep all the mines in an iteration within the limit with or without rewards for collecting mines.

Another idea I had was to have an effectively new Q matrix for each mine, so once a mine is collected, the sweeper transitions to that matrix and operates off that where the current mine is excluded from consideration.

Are there any better approaches that I can take with this, or perhaps more practical tweaks to my own approach that I can try?

A more explicit explanation of the rules:

The map edges wrap around, so moving off the right edge of the map will cause the bot to appear on the left edge etc.
The sweeper bot can move up down, left or right from any map tile.
When the bot collides with a mine, the mine is considered swept and then removed.
The aim is for the bot to learn to sweep all mines on the map from any starting position.

By a *move* you mean the interaction with a tile, right? So either uncovering it or flag it as a mine? Also, 50 iterations is really a tough constrain. — Timo, Oct 09 '19 at 16:15
@Timo Not quite interacting with a tile, it's not like the old minesweeper game. There is a sweeper robot that wanders around the map identifying mines, moving is literally moving, if it collides with a mine, that mine is considered swept and removed from the map. — Rhett Flanagan, Oct 09 '19 at 16:24
Oh I see, can you list or link the whole ruleset of the actual game? — Timo, Oct 09 '19 at 16:32
@Timo I've updated with a list of the ruleset as given. It may be easier to understand if you look at the exercise that it was based on, which is linked near the start of my post, that used ANNs to perform the same concept in a continuous environment. — Rhett Flanagan, Oct 09 '19 at 16:51
Do the agents have knowledge of their surrounding environment or of the mines? Otherwise this would be impossible. The article doesn't say anything about it tho. — Timo, Oct 10 '19 at 01:40
@Timo The agents do have knowledge of the mines, they can "see" the nearest active mines. I wanted to try getting the agent to learn to move towards the nearest active mine rather than learning to go to the positions of mines in the environment since that has problems with mines being removed but have been unsuccessful with that so far. — Rhett Flanagan, Oct 10 '19 at 06:52

Timo · Answer 1 · 2019-10-10T14:14:03.510

Given that the sweeper can always see the nearest mine, this should be pretty easy. From your question I assume your only problem is finding a good reward function and representation for your agent state.

Defining a state

Absolute positions are rarely useful in a random environment, especially if the environment is infinite like in your example (since the bot can drive over the borders and respawn at the other side). This means that the size of the environment isn't needed for the agent to operate (we will actually need it to simulate the infinite space, tho).

A reward function calculates its return value based on the current state of the agent compared to its previous state. But how do we define a state? Lets see what we actually need in order to operate the agent like we want it to.

The position of the agent.
The position of the nearest mine.

That is all we need. Now I said erlier that absolute positions are bad. This is because it makes the Q table (you call it Q matrix) static and very fragile to randomness. So let's try to completely eliminate abosulte positions from the reward function and replace them with relative positions. Luckily, this is very simple in your case: instead of using the absolute positions, we use the relative position between the nearest mine and the agent.

Now we don't deal with coordinates anymore, but vectors. Lets calculate the vector between our points: v = pos_mine - pos_agent. This vector gives us two very important pieces of information:

the direction in which the nearst mine is, and
the distance to the nearest mine.

And these are all we need to make our agent operational. Therefore, an agent state can be defined as

State: Direction x Distance

of which distance is a floating point value and direction either a float that describes the angle or a normalized vector.

Defining a reward function

Given our newly defined state, the only thing we care about in our reward function is the distance. Since all we want is to move the agent towards mines, the distance is all that matters. Here are a few guesses how the reward function could work:

If the agent sweeps a mine (distance == 0), return a huge reward (ex. 100).
If the agent moves towards a mine (distance is shrinking), return a neutral (or small) reward (ex. 0).
If the agent moves away from a mine (distance is increasing), retuan a negative reward (ex. -1).

Theoretically, since we penaltize moving away from a mine, we don't even need rule 1 here.

Conclusion

The only thing left is determining a good learning rate and discount so that your agent performs well after 50 iterations. But, given the simplicity of the environment, this shouldn't even matter that much. Experiment.

Q-learning to learn minesweeping behavior

1 Answers1

Defining a state

Defining a reward function

Conclusion