0

Problem1: We want to go from s to e. In each cell we can move right R or down D. The environment is fully known. The table has (4*5) 20 cells. The challenge is that we do not know what the reward of each cell is, but we will receive an overall reward as we pass and finish a path. Example: a solution can be RRDDRDR and the overall reward is 16.

s 3 5 1 5

1 2 4 5 1

7 3 1 2 8

9 2 1 1 e

The target is to find a set of actions from Start to End which maximizes the obtained overall reward. How can we distribute the overall reward among actions?

Problem2: This problem is the same as Problem1 but the rewards of problem environment is dynamic so that the way we reach a cell will affect the rewards of cells which are ahead. Example: for two movements of RRD and DRR, both will get us to the same cell but since they have different path, the ahead cells will have different rewards.

s 3 5 1 5

1 2 4 9 -1

7 3 2 -5 18

9 2 9 7 e

(RRD path, selecting this path will result in changes of rewards of ahead cells)

s 3 5 1 5

1 2 4 3 1

7 3 30 7 -8

9 2 40 11 e

(DRR path, selecting this path will result in changes of rewards of ahead cells)

The target is to find a set of actions from Start to End which maximizes the obtained overall reward. How can we distribute the overall reward between actions? (After passing a path from Start to End and the overall reward is obtained)

1 Answers1

0

Can you say more about the research you are doing? (The problem sounds a lot like the sort of thing someone might assign just to get you thinking about temporal credit assignment.)

  • This is more of a comment than an answer. – JamesS Sep 18 '19 at 12:54
  • Ah, I agree. I'm new here. Can I change the status of my "answer" then? – Michael L. Littman Sep 18 '19 at 14:58
  • To be more precise, I am working on Genetic Programming and I want to make a well-structured tree via reinforcement learning. Since a tree is comprised of nodes(as state) and inputs(actions) and a fitness of the tree as my overall reward or gain, I want to build such a tree by utilizing the overall reward and distribute it among the actions and states to find the best actions(function or variable or terminal) for each state(node). – Mohammad Abdollahi Sep 18 '19 at 15:09