I am working on a sequential decision making process, where a battery controller, given the renewable energy for a state, should follow an optimal policy that minimizes a global objective ( minimize costs of power purchased from the grid ). Following can be a supply = demand equation.
P_grid = P_house - P_solar + P_battery
( At each step, I know P_house, P_solar, and have to choose an action P_battery ).
States = ( P_house, P_solar, Energy )
Actions ( P_battery ) are discrete and can be positive or negative.
Intuitively, given the constant Cost, the reward function should be the ( - P_grid * Cost ). Such that overall there is less dependence on the grid. However, for a tabular Q Learning case, my agent converges to an optimal policy which makes P_grid = 0 ( almost ). This is somewhat drastic, since the state variable Energy is limited at each time steps and in turn limits my actions ( P_battery ).
How should I define my reward function to minimize the global objective AND ensure proper utilization of the battery energy?