0

I am taking a course in RL and many times, learning policy parameters of value function weights essentially boils down to using Stochastic Gradient Descent (SGD). The agent is represented as having a sequence of states S_t, actions A_t, and reaping rewards R_t at time t of the sequence.

My understanding of SGD in general, e.g when applied using training datasets on neural nets, is that we assume the data in the mini-batches to be iid, and this makes sense because in a way we are "approximating" an expectation using an average of gradients over points that are supposedly drawn from independent but exactly similar distributions. So why is it that we use SGD in RL while incrementing through time? Is that due to the implicit assumption of conditional independence for the distribution of p(S_t | S_{t-1})?

Thanks for clarifying this point. Amine

Amine
  • 134
  • 11
  • 1
    Try asking your question on stats.stackexchange.com, rather than here. – Peter O. Dec 28 '20 at 20:24
  • 1
    @PeterO. No, Stats SE is not the most appropriate site to ask RL questions. The most appropriate site to ask RL questions is [Artificial Intelligence SE](https://ai.stackexchange.com/). We have a bunch of RL enthusiasts at AI SE (including myself). We are the site with the highest number of RL questions across all SE sites (apart from SO, but SO existed way before AI SE and has a bigger community), so we have (a lot) more RL questions than Stats SE, Data Science SE and CS SE, for some reason. – nbro Dec 29 '20 at 19:14
  • 1
    Hi Amine. As stated above, I suggest that you ask this question on Artificial Intelligence SE, then delete it from here (Stack Overflow) and Stats SE, to avoid cross-posting, which is discouraged. – nbro Dec 29 '20 at 19:17
  • Thanks guys. I already did that. I will close this thread shortly. Thanks – Amine Dec 29 '20 at 19:30

0 Answers0