Q-learning value update

Question

I am working on the power management of a device using Q-learning algorithm. The device has two power modes, i.e., idle and sleep. When the device is asleep, the requests for processing are buffered in a queue. The Q-learning algorithm looks for minimizing a cost function which is a weighted sum of the immediate power consumption and the latency caused by an action.

c(s,a)=lambda*p_avg+(1-lambda)*avg_latency

In each state, the learning algorithm takes an action (executing time-out values) and evaluates the effect of the taken action in next state (using above formula). The actions are taken by executing certain time-out values from a pool of pre-defined time-out values. The parameter lambda in above equation is a power-performance parameter (0_<lambda<1). It defines whether the algorithm should look for power saving (lambda-->1) or should look for minimizing latency (lambda-->0). The latency for each request is calculated as queuing-time + execution-time.
The problem is that the learning algorithm always favors small time-out values in sleep state. It is because the average latency for small time-out values is always lower, and hence their cost is also small. When I change the value of lambda from lower to higher, I don't see any effect in the final output policy. The policy always selects small time-out values as best actions in each state. Instead of average power and average latency for each state, I have tried using overall average power consumption and overall average latency for calculating cost for a state-action pair, but it doesn't help. I also tried using total energy consumption and total latency experinced by all the request for calculating cost in each state-action pair, but it doesn't help either. My question is: what could be a better cost function for this scenario? I update the Q-value as follows:

Q(s,a)=Q(s,a)+alpha*[c(s,a)+gamma*min_a Q(s',a')-Q(s,a)]

Where alpha is a learning rate (decreased slowly) and gamma=0.9 is a discount factor.

Just as a note for convention: avoid the use of the term "lambda" in your reward function. Q(lambda) is another algorithm - which, not coincidentally, might be more appropriate for this problem (assuming you are using 1-step Q-learning. — Throwback1986, Aug 08 '12 at 17:52
Yes, it's a 1-step learning. lambda in the cost function is just a power-performance tradeoff parameter. If I omit this parameter in the cost function, how do I define a criteria for either power saving or latency minimization? — user846400, Aug 09 '12 at 08:23
I'm not suggesting that you omit the parameter. I'm suggesting you choose another variable to denote your "power performance parameter". — Throwback1986, Aug 09 '12 at 14:54
My confusion is that if I use 1-step Q-learning (as above), shall I use the entire power consumption and entire latency for all the requests to calculate the cost in each state (s,a)? or shall I use the immediate power consumption and average latency caused by an action _a_ in state _s_? — user846400, Aug 09 '12 at 18:54

score 2 · Answer 1 · answered Aug 13 '12 at 02:50

2

To answer the questions posed in the comments:

shall I use the entire power consumption and entire latency for all the requests to calculate the cost in each state (s,a)?

No. In Q-learning, reward is generally considered an instantaneous signal associated with a single state-action pair. Take a look at Sutton and Barto's page on rewards. As shown the instantaneous reward function (r_t+1) is subscripted by time step - indicating that it is indeed instantaneous. Note that R_t, that expected return, considers the history of rewards (from time t back to t_0). Thus, there is no need for you to explicitly keep track of accumulated latency and power consumption (and doing so is likely to be counter-productive.)

or shall I use the immediate power consumption and average latency caused by an action a in state s?

Yes. To underscore the statement above, see the definition of an MDP on page 4 here. The relevant bit:

The reward function specifies expected instantaneous reward as a function of the current state and action

As I indicated in a comment above, problems in which reward is being "lost" or "washed out" might be better solved with a Q(lambda) implementation because temporal credit assignment is performed more effectively. Take a look at Sutton and Barto's chapter on TD(lambda) methods here. You can also find some good examples and implementations here.

answered Aug 13 '12 at 02:50

Throwback1986

5,887
1
30
22

Thanks Throwback1986. Sutton and Barto talk about _maximizing_ the reward throughout their book. My confusion is that: do all the algorithms hold true for minimizing a cost in the same way, instead of maximizing a reward. In my case, I have to minimize the above mentioned cost function based on a power-performance constraint. – user846400 Aug 13 '12 at 10:11
Minimizing negative reward is the same as maximizing positive reward. It's just a matter of sign convention. If you express your cost function as a negative reinforcement, your agent can by guided to select actions that minimize cost (thereby maximizing reward). – Throwback1986 Aug 13 '12 at 14:20
Thanks Trowback1986. Your hints are very useful. There is one more question. If the next state and the current state by action are same, shall I assign cost to this state, or shall I wait till the state changes and assign the cost in terms of total energy consumption and latency? For example, if the algorithm executes a time-out period in sleep state and no requests come by the end of time-out period, the system finds itself in the same state. – user846400 Aug 13 '12 at 14:26
Note the time subscript on the the reward function: this should tell you that a new reward (or cost) is to be computed at each time-step. Recall that reward is computed as r(s,a), i.e. a function of state and action. There is no inherent restriction that forces s_t-1 to be different from s_t. – Throwback1986 Aug 13 '12 at 14:46
Thanks. In my case, could you suggest a better cost function to deal with the power consumption and per request latency with respect to a selected power-performance parameter? – user846400 Aug 14 '12 at 11:28
You might consider moving to an episodic task. For example, allow the agent to train for a randomly-selected number of steps (ensure that the minimum number of steps is large enough to accomplish something!) Then assign reward using your function based on accumulated power consumption and latency. Note: this will likely take a large number of episodes to learn a policy, but it seems consistent with your goal (as I understand it). Again, Q lambda is recommended. Also consider using exploring stars to ensure adequate state space exploration. – Throwback1986 Aug 16 '12 at 13:53
For example, if I choose 100 steps randomly as an episode. After 100 steps, which state-action should be assigned cost? All the state-action pair? Does it mean I will have to keep track of the accumulated power+latency for all the state-action pairs and then assign the reward to all of them after the episode, right? – user846400 Aug 17 '12 at 17:20
BTW, I am now using Q(lambda) learning with eligibility traces as described in Sutton's book. I do see some improvements. The learning now goes from higher power consumption (low latency) to low power consumption (high latency) as I vary lambda in the above cost function. The problem is that the learning achieves this by just increasing the time-out values in sleep state, while keeping the time-out values in idle state to minimum. What I expect is that it should increase time-out in sleep while decrease the time-out values in idle (from largest to smallest) as I vary lambda from 0 to 1. – user846400 Aug 17 '12 at 17:48
To the above comment: take a look at the Mountain Car problem. I think this is a reasonable model, and the application of the reward function is well-covered. – Throwback1986 Aug 17 '12 at 19:44

Q-learning value update

1 Answers1