0

Good morning, Im facing a "RL" problem, which have many constraints, the main idea is that my agent will control many different machines with for example ordering them to go out for doing their missions (we don't give importance for the mission), or ordering them to enter to the depot and choosing for them the right place where they should sit (depending from constraints). The problem is: the agent will take decision at periods of time that are defined, for each periode we know which of actions (go out, go in) are allowed. He will for example at 8oclock decide to order for 4 machines to go out, and at 14oclock decide to bring back 2 machines(with choosing for them the right place).

In literature i show many ideas which refers to BDQ, but is it recquired for my problem ? Im thinking about having actions like [chooseMachine1, chooseMachine2,chooseMachine3...chooseMachineN, goOut, goInPlace1, goInPlace2, goInPlace3, goInPlace4]. And in the code specifying the logic that depending of the period we are, i expose for the begening a number M<=N of the machines to choose (with giving 0 probability to those actions that aren't possible for the moment' if it is 14oclock you know that only the machines that are out are concerned with the agent decision'), if the agent choose Machine1, so he will access to only the possible actions from choosing it.

So, my question is, do you think that my ideas are right ? (am beginner), my idea is to make a DQN with giving the logic for the possible/impossible actions, Do you think that a BDQ is more accurate with my problem ? like having N branchs for N machines which have the same possible actions (brach1(Machine1) : go out, goPlace1, goPlace2 ...) If it is the case is there any implementation examples ?

If you have ressources to advise me, i will be glad of checking them.

Thank You

koussix
  • 1
  • 2

1 Answers1

0

What would an agent navigating a maze do in case the chosen action would run it into a wall?

I think the usual approach in RL is to allow the move and than handle the result with the environment. In such a way the environment can simply make nothing happen or even give a negative reward when an action is "disallowed".

At training convergence the agent will hopefully learn to not chose ineffective actions.

marco romelli
  • 1,143
  • 8
  • 19
  • so you think that for example i can let the agent decide to take action "goPlace1" even if there is no machine choosen ? i saw in another topic that it can help the agent to converge more rapidly, but you are probably right. And do you think that a DQN can be sufficient ? juste with alternating the env when the agent choose a machine (maybe with making a parameter like id of machine that is chosen at time 'T' =>0 if no choice , and then can take the decision of placing the machine) – koussix Jul 07 '22 at 15:48
  • Yes. One way to simplify could be to mix the "choose" and "go" in single actions, e.g. [machine1_goout, machine1_goto1, machine1_goto2, ...] – marco romelli Jul 08 '22 at 09:39
  • if I do that i will have so much actions... i just mentioned an example but in fact i can have more than 30 machines and maybe 40 places for example so, 40*30, with the go out and in... it will be a large space of actions, i don't know if it is recommanded.. Another problem that i don't understand is, how can i implement the logic of taking for example only 3 actions per period, is it like seeing in the env if the 4th action is taken so i apply a penality ? or just adding a 'doNothing' action – koussix Jul 08 '22 at 13:28
  • The action space for chess should be something like 4672 so I don't think your case is a problem. If an agent tries to perform more than 3 actions per period the env should just do nothing but I'm not sure I understand your concept of period. It's you to decide when an agent takes actions so I don't see why this would be a problem. – marco romelli Jul 08 '22 at 15:34