I am working on the power management of a device using Q-learning algorithm. The device has two power modes, i.e., idle and sleep. When the device is asleep, the requests for processing are buffered in a queue. The Q-learning algorithm looks for minimizing a cost function which is a weighted sum of the immediate power consumption and the latency caused by an action.
c(s,a)=lambda*p_avg+(1-lambda)*avg_latency
In each state, the learning algorithm takes an action (executing time-out values) and evaluates the effect of the taken action in next state (using above formula). The actions are taken by executing certain time-out values from a pool of pre-defined time-out values. The parameter lambda in above equation is a power-performance parameter (0_<lambda<1). It defines whether the algorithm should look for power saving (lambda-->1) or should look for minimizing latency (lambda-->0). The latency for each request is calculated as queuing-time + execution-time.
The problem is that the learning algorithm always favors small time-out values in sleep state. It is because the average latency for small time-out values is always lower, and hence their cost is also small. When I change the value of lambda from lower to higher, I don't see any effect in the final output policy. The policy always selects small time-out values as best actions in each state. Instead of average power and average latency for each state, I have tried using overall average power consumption and overall average latency for calculating cost for a state-action pair, but it doesn't help. I also tried using total energy consumption and total latency experinced by all the request for calculating cost in each state-action pair, but it doesn't help either. My question is: what could be a better cost function for this scenario? I update the Q-value as follows:
Q(s,a)=Q(s,a)+alpha*[c(s,a)+gamma*min_a Q(s',a')-Q(s,a)]
Where alpha is a learning rate (decreased slowly) and gamma=0.9 is a discount factor.