Questions tagged [sarsa]

SARSA (State-Action-Reward-State-Action) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning.

SARSA (State-Action-Reward-State-Action) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning.

Algorithm Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t+1} + \gamma Q(s_{t+1}, a_{t+1})-Q(s_t,a_t)] A SARSA agent will interact with the environment and update the policy based on actions taken, known as an on-policy learning algorithm. As expressed above, the Q value for a state-action is updated by an error, adjusted by the learning rate alpha. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation. Watkin's Q-learning was created as an alternative to the existing temporal difference technique and which updates the policy based on the maximum reward of available actions. The difference may be explained as SARSA learns the Q values associated with taking the policy it follows itself, while Watkin's Q-learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy. For further information on the exploration/exploitation trade off, see reinforcement learning.

Some optimizations of Watkin's Q-learning may also be applied to SARSA, for example in the paper "Fast Online Q(λ)" (Wiering and Schmidhuber, 1998) the small differences needed for SARSA(λ) implementations are described as they arise.

33 questions
146
votes
8 answers

What is the difference between Q-learning and SARSA?

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms. According to the book Reinforcement Learning: An Introduction (by Sutton and…
Ælex
  • 14,432
  • 20
  • 88
  • 129
17
votes
1 answer

Eligibility trace reinitialization between episodes in SARSA-Lambda implementation

I'm looking at this SARSA-Lambda implementation (Ie: SARSA with eligibility traces) and there's a detail which I still don't get. (Image from http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node77.html) So I understand that all Q(s,a) are updated…
MrD
  • 4,986
  • 11
  • 48
  • 90
11
votes
3 answers

Are Q-learning and SARSA with greedy selection equivalent?

The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state. If a greedy selection policy is used, that is, the…
Mouscellaneous
  • 2,584
  • 3
  • 27
  • 37
6
votes
1 answer

Episodic Semi-gradient Sarsa with Neural Network

While trying to implement the Episodic Semi-gradient Sarsa with a Neural Network as the approximator I wondered how I choose the optimal action based on the currently learned weights of the network. If the action space is discrete I can just…
zimmerrol
  • 4,872
  • 3
  • 22
  • 41
5
votes
1 answer

Why is there no n-step Q-learning algorithm in Sutton's RL book?

I think I am messing something up. I always thought that: - 1-step TD on-policy = Sarsa - 1-step TD off-policy = Q-learning Thus I conclude: - n-step TD on-policy = n-step Sarsa - n-step TD off-policy = n-step Q-learning In Sutton's book,…
siva
  • 1,183
  • 3
  • 12
  • 28
4
votes
1 answer

Understanding linear, gradient-descent Sarsa (based on Sutton & Barto)

I'm trying to implement linear gradient-descent Sarsa based on Sutton & Barto's Book, see the algorithm in the picture below. However, I struggle to understand something in the algorithm: Is the dimension of w and z independent of how many…
bbiegel
  • 207
  • 2
  • 8
3
votes
1 answer

SARSA value approximation for Cart Pole

I have a question on this SARSA FA. In input cell 142 I see this modified update w += alpha * (reward - discount * q_hat_next) * q_hat_grad where q_hat_next is Q(S', a') and q_hat_grad is the derivative of Q(S, a) (assume S, a, R, S' a'…
Chuk Lee
  • 3,570
  • 22
  • 19
3
votes
1 answer

Sarsa algorithm, why Q-values tend to zero?

I'm trying to implement Sarsa algorithm for solving a Frozen Lake environment from OpenAI gym. I've started soon to work with this but I think I understand it. I also understand how Sarsa algorithm works, there're many sites where to find a…
CSR95
  • 121
  • 8
3
votes
1 answer

Deep Neural Network combined with qlearning

I'm using joint positions from a Kinect camera as my state space but I think it's going to be too large (25 joints x 30 per second) to just feed into SARSA or Qlearning. Right now I'm using the Kinect Gesture Builder program which uses Supervised…
3
votes
0 answers

Implementing Eligibility Traces in SARSA

I am writing a MATLAB implemention of the SARSA algorithm, and have successfully writtena one-step implementation. I am now trying to extend it to use eligibility traces, but the results I obtain are worse than with one-step. (Ie: The algorithm…
MrD
  • 4,986
  • 11
  • 48
  • 90
3
votes
1 answer

SARSA Implementation

I am learning about SARSA algorithm implementation and had a question. I understand that the general "learning" step takes the form of: Robot (r) is in state s. There are four actions available: North (n), East (e), West (w) and South (s) such…
MrD
  • 4,986
  • 11
  • 48
  • 90
2
votes
0 answers

Implementing Sarsa(lambda) - Gridworld - in Julia language

Could you explain me what is wrong in this code ? I am trying to implement SARSA(lamda) with eligibility traces. using ReinforcementLearningBase, GridWorlds using PyPlot world = GridWorlds.GridRoomsDirectedModule.GridRoomsDirected(); env =…
przel123
  • 21
  • 1
2
votes
0 answers

is this true ? what about Expected SARSA and double Q-Learning?

I‘m studying Reinforcement Learning and I’m facing a problem understanding the difference between SARSA, Q-Learning, expected SARSA, Double Q Learning and temporal difference. Can you please explain the difference and tell me when to use each? And…
2
votes
1 answer

Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible…
T.L
  • 21
  • 4
2
votes
1 answer

Sarsa with neural network to solve the Mountain Car Task

I am trying to implement the Episodic Semi-gradient Sarsa for Estimating q described in Sutton's book to solve the Mountain Car Task. To approximate q I want to use a neural network. Therefore, I came up with this code. But sadly my agent is not…
zimmerrol
  • 4,872
  • 3
  • 22
  • 41
1
2 3