1

I understand epsilon-greedy algorithm, but there is one point of confusion.

  1. Is it average reward or value that it keeps track of? Most of the time, it is explained in the context of multi-armed bandit. However, there is no distinction of reward / value in the problem of multi-armed bandit.
  2. is epsilon-greedy algorithm a subset of Q-learning? The vague definition of Q-learning seems to be: approximating the optimal Q-function by utilizing past experiences.
AgnosticCucumber
  • 616
  • 1
  • 7
  • 21

1 Answers1

3

Epsilon-greedy is a policy, not an algorithm. It is exclusive of discrete action problems: you select the action according to

argmax Q(s,a) with probability 1-epsilon
random otherwise

You can use with Q-learning, SARSA, DDPG, policy gradient, ...

Simon
  • 5,070
  • 5
  • 33
  • 59