-1

Please take a look at picture below :

My Objective is that the agent rotating and moving in the environment and not falling in fire holes, I have think like this :

enter image description here

Do for 1000 episodes:
An Episode :
start to traverse the environment;
if falls into a hole , back to first place !

So I have read some where : goal is an end point for an episode , So if we think that goal is not to fall in fires , the opposite of the goal (i.e. putting in fire holes) will be end point of an episode . what you will suggest for goal setting ?

Another question is that why should I set the reward matrix ? I have read that Q Learning is Model Free ! I know that In Q Learning we will setup the goal and not the way for achieving to it . ( in contrast to supervised learning.)

Michael Petch
  • 46,082
  • 8
  • 107
  • 198
S.A.Parkhid
  • 2,772
  • 6
  • 28
  • 58

1 Answers1

1

Lots of research has been directed to reward functions. Crafting a reward function to produce desired behavior can be non-intuitive. As Don Reba commented, simply staying still (as long as you don't begin in a fire state!) is an entirely reasonable approach for avoiding fire. But that's probably not what you want.

One way to spur activity (and not camp in a particular state) is to penalize the agent for each timestep experienced in a non-goal state. In this case, you might assign a -1 reward for each timestep spent in a non-goal state, and a zero reward for the goal state.

Why not a +1 for goal? You might code a solution that works with a +1 reward but consider this: if the goal state is +1, then the agent can compensate for any number of poor, non-optimal choices by simply parking in the goal state until the reward becomes positive.

A goal state of zero forces the agent to find the quickest path to the goal (which I assume is desired). The only way to maximize reward (or minimize negative reward) is to find the goal as quickly as possible.

And the fire? Assign a reward of -100 (or -1,000 or -1,000,000 - whatever suits your aims) for landing in fire. The combination of +0 for goal, -1 for non-goal, and -100 for fire should provide a reward function that yields the desired control policy.

Footnote: Google "negative bounded Markov Decision Processes (MDPs)" for more information on these reward functions and the policies they can produce.

Throwback1986
  • 5,887
  • 1
  • 30
  • 22