Reinforcement Learning with MDP for revenues optimization

Question

I want to modelize the service of selling seats on an airplane as an MDP( markov decision process) to use reinforcement learning for airline revenues optimization, for that I needed to define what would be: states, actions, policy, value and reward. I thought a little a bit about it, but i think there is still something missing.

I modelize my system this way:

States = (r,c) where r is the number of passengers and c the number of seats bought so r>=c.
Actions = (p1,p2,p3) that are the 3 prices. the objective is to decide which one of them give more revenues.
Reward: revenues.

Could you please tell me what do u think and help me?

After the modelization, I have to implement all of that wit Reinforcement Learning. Is there a package that do the work ?

score 0 · Accepted Answer · answered Jun 07 '18 at 18:33

0

I think the biggest thing missing in your formulation is the sequential part. Reinforcement learning is useful when used sequentially, where the next state has to be dependent on the current state (thus the "Markovian"). In this formulation, you have not specified any Markovian behavior at all. Also, the reward is a scalar which is dependent on either the current state or the combination of current state and action. In your case, the revenue is dependent on the price (the action), but it has no correlation to the state (the seat). These are the two big problems that I see with your formulation, there are others as well. I will suggest you to go through the RL theory (online courses and such) and write a few sample problems before trying to formulate your own.

answered Jun 07 '18 at 18:33

shunyo

1,277
15
32

@shynyo, thank you for your response. Actually, I have seen a little bit some of RL examples, for instance the grid world example. I saw that in the grid, each state is modelize as I did with a tuple (r,c). They use epsilon greedy to choose an action, and once done, they use some probabilities P(s'/s,a) to get to the next state. I wanted to do the same, so I choosed 3 actions, and then I implemented a function [code] (def get_next_state(r,c,action,df_experiment):) that calculate my probabilities of selling or not the seat, and then go to the next state, what do you think? – fatima-ezzahra elaamraoui Jun 08 '18 at 07:09
the revenue is calculated in python with a function that add immediate reward to old reward, and immediate reward depend on the action done and whether the seat is purchased or not which depends on the purchasing probability calculated by a model for the customer that arrived: def revenue(bought,action,r,r_total): So do you think with these details that the sequential part is still missing? otherwise do you think that adding time to my states could give a sequential part to my states? and thank you. – fatima-ezzahra elaamraoui Jun 08 '18 at 07:11
, I can give more details if you want. – fatima-ezzahra elaamraoui Jun 08 '18 at 07:39

score 0 · Answer 2 · answered Jul 18 '23 at 14:41

Adding this here for anyone stumbling across this topic and looking for an answer:

The sequential part should be different time steps (e.g. days/hours) for implementing a certain pricing action. The Reward is the revenue achieved in that time step (price*quantity), and the future rewards will be based on the number of seats remaining unsold and the potential prices which they can be sold for

State: current number of seats remaining unsold and passengers looking to purchase

Actions: potential seat prices, with probabilities of different numbers of seats being sold at different prices (transition probabilities)

Rewards: revenue from seats sold in current state

In terms of then optimising this, the Bellman equation is a common approach

Reinforcement Learning with MDP for revenues optimization

2 Answers2