I understand epsilon-greedy algorithm, but there is one point of confusion.
- Is it average reward or value that it keeps track of? Most of the time, it is explained in the context of multi-armed bandit. However, there is no distinction of reward / value in the problem of multi-armed bandit.
- is epsilon-greedy algorithm a subset of Q-learning? The vague definition of Q-learning seems to be: approximating the optimal Q-function by utilizing past experiences.