I have N number of agents/users accessing a single wireless channel and at each time, only one agent can access the channel and receive a reward.
Each user has a buffer that can store B number of packets and I assume it as infinite buffer.
Each user n gets observation from the environment if the packet in time slot t was successful or failure (collision). If more than one users access the channel, they get penalty.
This feedback from the channel is same for all the users since we only have one channel. The reward is - B_n (negative of the number of packets in buffer). Each user wants to maximize its own reward and try to empty the buffer.
Packets arrive at each users following a poisson process with average $\lambda$ packets per time slot.
Each user has a history of previous 10 time slots that it uses as an input to the DQN to output the probability of taking action A_n: stay silent or transmit. The history is (A_n, F, B_n)
Each user is unaware of the action and buffer status of other users.
I am trying to model my problem with multiagent reinforcement learning and so far I have tried it with DQN but results are more or less like a random scheme. It could be that users don't have much contextual information in order to learn the behaviour of other users? Or can there be any other reason?
I would like to know how can I model my environment since the state (in RL sense) is static, the environment doesn't changes. The only thing that changes is each users history at each time slot. So I am not sure if its a partially observable MDP or should it be modelled as multiagent single-arm bandit problem which I don't know is correct or not.
The second concern is that I have tried DQN but it has not worked and I would like to know if such problem can be used with tabular Q-learning? I have not seen multiagent works in which anyone has used QL. Any insights might be helpful.