Q Learning Techniuqe for not falling in fires

Question

Please take a look at picture below :

My Objective is that the agent rotating and moving in the environment and not falling in fire holes, I have think like this :

Do for 1000 episodes:
An Episode :
start to traverse the environment;
if falls into a hole , back to first place !

So I have read some where : goal is an end point for an episode , So if we think that goal is not to fall in fires , the opposite of the goal (i.e. putting in fire holes) will be end point of an episode . what you will suggest for goal setting ?

Another question is that why should I set the reward matrix ? I have read that Q Learning is Model Free ! I know that In Q Learning we will setup the goal and not the way for achieving to it . ( in contrast to supervised learning.)

What has this to do with [tag:c++] actually?? Did you even read the tag info, when this tag can/should be applied? If you're asking about a particular (c++) programming problem provide a [mcve] of your code please. — πάντα ῥεῖ, Nov 09 '15 at 20:06
I think that c++ developers can help , because their view is algorithmic. you can write a comment first before down voting or hitting close . — S.A.Parkhid, Nov 09 '15 at 20:08
@S.A.Parkhid How should the c++ developers support you? Writing c++ code? If you are looking for algos tag [tag:algorithm] but not c++. — πάντα ῥεῖ, Nov 09 '15 at 20:11
@james , Please tell me more about Mornington Crescent and my problem — S.A.Parkhid, Nov 09 '15 at 20:12
@S.A.ParkhidI think that was a [joke](https://en.m.wikipedia.org/wiki/Mornington_Crescent_(game)). Martin _doesn't have a clue_. — πάντα ῥεῖ, Nov 09 '15 at 20:13
If the goal is to avoid falling into a fire, the perfect strategy is to stand in one place. — Don Reba, Nov 09 '15 at 20:19

score 1 · Answer 1 · answered Nov 24 '15 at 05:15

Lots of research has been directed to reward functions. Crafting a reward function to produce desired behavior can be non-intuitive. As Don Reba commented, simply staying still (as long as you don't begin in a fire state!) is an entirely reasonable approach for avoiding fire. But that's probably not what you want.

One way to spur activity (and not camp in a particular state) is to penalize the agent for each timestep experienced in a non-goal state. In this case, you might assign a -1 reward for each timestep spent in a non-goal state, and a zero reward for the goal state.

Why not a +1 for goal? You might code a solution that works with a +1 reward but consider this: if the goal state is +1, then the agent can compensate for any number of poor, non-optimal choices by simply parking in the goal state until the reward becomes positive.

A goal state of zero forces the agent to find the quickest path to the goal (which I assume is desired). The only way to maximize reward (or minimize negative reward) is to find the goal as quickly as possible.

And the fire? Assign a reward of -100 (or -1,000 or -1,000,000 - whatever suits your aims) for landing in fire. The combination of +0 for goal, -1 for non-goal, and -100 for fire should provide a reward function that yields the desired control policy.

Footnote: Google "negative bounded Markov Decision Processes (MDPs)" for more information on these reward functions and the policies they can produce.

Q Learning Techniuqe for not falling in fires

1 Answers1