How to choose a reward function for an optimization in Reinforcement Learning?

Question

I am working on a sequential decision making process, where a battery controller, given the renewable energy for a state, should follow an optimal policy that minimizes a global objective ( minimize costs of power purchased from the grid ). Following can be a supply = demand equation.

P_grid = P_house - P_solar + P_battery
( At each step, I know P_house, P_solar, and have to choose an action P_battery ).

States = ( P_house, P_solar, Energy )

Actions ( P_battery ) are discrete and can be positive or negative.

Intuitively, given the constant Cost, the reward function should be the ( - P_grid * Cost ). Such that overall there is less dependence on the grid. However, for a tabular Q Learning case, my agent converges to an optimal policy which makes P_grid = 0 ( almost ). This is somewhat drastic, since the state variable Energy is limited at each time steps and in turn limits my actions ( P_battery ).

How should I define my reward function to minimize the global objective AND ensure proper utilization of the battery energy?

I don't think I fully understand the problem. You choose a positive value for the battery to represent discharging it, and a negative value to represent charging it? And by setting `P_grid` to zero, the policy found by Q-learning is over-using the battery, i.e. by drawing no power from the grid? Perhaps your reward function is not the problem, but you are discounting future knowledge too much. I assume the battery discharges? Maybe try adjusting the discount factor? — Ryan Marcus, Jul 19 '16 at 03:00
Hey Ryan, your understanding is correct. Let me put it better, P_battery < 0 is discharging, P_battery> 0 is charging. Just a convention. By 'overusing the battery' I mean, I have constrained the actions in a routine such that the resultant Energy of the battery on charging/discharging will never fall beyond a limit. Another way to look at it would be the options I have for choosing an action will vary at every timestep such that overusing is impossible (depending on the Energy in the current timestep). So naturally, P_grid = 0 (almost) doesn't make sense to me. — wannabe_nerd, Jul 19 '16 at 06:43

How to choose a reward function for an optimization in Reinforcement Learning?

0 Answers0