Multiagent (not deep) reinforcement learning? Modeling the problem

Question

I have N number of agents/users accessing a single wireless channel and at each time, only one agent can access the channel and receive a reward.

Each user has a buffer that can store B number of packets and I assume it as infinite buffer.

Each user n gets observation from the environment if the packet in time slot t was successful or failure (collision). If more than one users access the channel, they get penalty.

This feedback from the channel is same for all the users since we only have one channel. The reward is - B_n (negative of the number of packets in buffer). Each user wants to maximize its own reward and try to empty the buffer.

Packets arrive at each users following a poisson process with average $\lambda$ packets per time slot.

Each user has a history of previous 10 time slots that it uses as an input to the DQN to output the probability of taking action A_n: stay silent or transmit. The history is (A_n, F, B_n)

Each user is unaware of the action and buffer status of other users.

I am trying to model my problem with multiagent reinforcement learning and so far I have tried it with DQN but results are more or less like a random scheme. It could be that users don't have much contextual information in order to learn the behaviour of other users? Or can there be any other reason?

I would like to know how can I model my environment since the state (in RL sense) is static, the environment doesn't changes. The only thing that changes is each users history at each time slot. So I am not sure if its a partially observable MDP or should it be modelled as multiagent single-arm bandit problem which I don't know is correct or not.

The second concern is that I have tried DQN but it has not worked and I would like to know if such problem can be used with tabular Q-learning? I have not seen multiagent works in which anyone has used QL. Any insights might be helpful.

score 0 · Answer 1 · answered Nov 25 '20 at 19:34

Your problem can be modeled as a Decentralized POMDP (see a overview here).

Summarizing this approach, you consider a multi-agent system where each agent model his own policy, and then you try to build a joint policy through these individual ones. Of course that, the complexity grows as the number of agents, states and actions increases,so for that you have several approaches mainly based in heuristics to prune branches of this joint policy tree that are not "good" in comparison with others. A very know example using this approach is exactly about routing packages where is possible define a discrete action/space.

But be aware that even for tiny system, the complexity becomes often infeasible!

Multiagent (not deep) reinforcement learning? Modeling the problem

1 Answers1