Alpha and Gamma parameters in QLearning

Question

What difference to the algorithm does it make having a big or small gamma value? In my optic, as long as it is neither 0 or 1, it should work exactly the same. On the other side, whatever gamma I choose, it seems the Qvalues get pretty close to zero really quickly(I'm having here values on the order of 10^-300 just in a quick test). How do usually people plot Qvalues (i'm plotting a (x, y, best QValue for that state) given that problem? I'm trying to get around with logarithms but even then it feels kinda awkward.

Also, I don't get what is the reason behind having and alpha parameter in the Q Learning update function. It basically sets the magnitude of the update we are going to make to the Q value function. I have the idea that it is usually decreased over time. What is the interest in having it decrease over time? An update value in the beginning should have more importance than 1000 episodes later?

Also, I was thinking that a good idea for exploring the state space every time the agent doesn't want to do the greedy action would be to explore any state that still has a zero QValue(this means, at least most of the times, a state never before done), but I don't see that referred in any literature. Are there any downsides to this? I know this can't be used with (at least some) generalization functions.

Other idea would be to keep a table of visited states/actions, and try to do the actions that were tried less times before in that state. Of course this can only be done in relatively small state spaces(in my case it is definitely possible).

A third idea for late in the exploration process would be to look not only to the selected action looking for the best qvalues but also look inside all those actions possible and that state, and then in the others of that state and so.

I know those questions are kinda unrelated but I'd like to hear the opinions of people that have worked before with this and (probably) struggled with some of them too.

What was the policy? What is the problem? What are the states? What motivates the work? What code did you use? Did you use a reference problem to show your code works? — EngrStudent, Apr 27 '17 at 16:58

user1949902 · Answer 1 · 2013-08-31T00:43:36.447

From a reinforcement leaning masters candidate:

Alpha is the learning rate. If the reward or transition function is stochastic (random), then alpha should change over time, approaching zero at infinity. This has to do with approximating the expected outcome of a inner product (T(transition)*R(reward)), when one of the two, or both, have random behavior.

That fact is important to note.

Gamma is the value of future reward. It can affect learning quite a bit, and can be a dynamic or static value. If it is equal to one, the agent values future reward JUST AS MUCH as current reward. This means, in ten actions, if an agent does something good this is JUST AS VALUABLE as doing this action directly. So learning doesn't work at that well at high gamma values.

Conversely, a gamma of zero will cause the agent to only value immediate rewards, which only works with very detailed reward functions.

Also - as for exploration behavior... there is actually TONS of literature on this. All of your ideas have, 100%, been tried. I would recommend a more detailed search, and to even start googling Decision Theory and "Policy Improvement".

Just adding a note on Alpha: Imagine you have a reward function that spits out 1, or zero, for a certain state action combo SA. Now every time you execute SA, you will get 1, or 0. If you keep alpha as 1, you will get Q-values of 1, or zero. If it's 0.5, you will get values of +0.5, or 0, and the function will always oscillate between the two values for ever. However, if everytime you decrease your alpha by 50 percent, you get values like this. (assuming reward is recieved 1,0,1,0,...). Your Q-values will end up being, 1,0.5,0.75,0.9,0.8,.... And will eventually converge kind of close to 0.5. At infinity it will be 0.5, which is the expected reward in a probabilistic sense.

All the facts you described about the choice of alpha are completely valid both for Q-learning and Deep Q-learning (and its variants)? — AleB, Mar 11 '20 at 14:15

score 2 · Answer 2 · answered Oct 25 '17 at 12:19

What difference to the algorithm does it make having a big or small gamma value?

gammas should correspond to the size of observation space: you should use larger gammas (ie closer to 1) for big state spaces, and smaller gammas for smaller spaces.

one way to think about gamma is it represents the decay rate of a reward from the final, successful state.

score -3 · Answer 3 · answered Dec 07 '09 at 21:27

I haven't worked with systems exactly like this before, so I don't know how useful I can be, but...

Gamma is a measure of the agent's tendency to look forward to future rewards. The smaller it is, the more the agent will tend to take the action with the greatest reward, regardless of resultant state. Agents with larger gamma will learn long paths to big rewards. As for all Q values approaching zero, have you tried with a very simple state map (say, one state and two actions) with gamma=0? That should quickly approach Q=reward.

The idea of reducing alpha is to damp down oscillations in the Q values, so that the agent can settle into a stable pattern after a wild youth.

Exploring the state space? Why not just iterate over it, have the agent try everything? There's no reason to have the agent actually follow a course of action in its learning-- unless that's the point of your simulation. If the idea is just to find the optimal behavior pattern, adjust all Q's, not just the highest ones along a path.

The point in doing Q-Learning is not to iterate over all space. It's precisely to learn as fast as possible(i.e., having giant state spaces, learning fast how to explore them well enough for a given task). If the ideia were to iterate over it, then I'd use a typical search system(breath first, deep search, etc). Also, I don't get what is the point of setting a gamma to zero. It will only do the actions that lead to the goal being updated. All the others will be equal to zero. — devoured elysium, Dec 09 '09 at 05:28

Alpha and Gamma parameters in QLearning

3 Answers3