Ideally, you want to learn the true Q-function, i.e., the one that satisfies the Bellman equation
Q(s,a) = R(s,a) + gamma*E[Q(s',a')] forall s,a
where the expectation is over a'
w.r.t the policy.
First, we approximate the problem and get rid of the "forall" because we have access only to few samples (especially in continuous action, where the "forall" results in infinitely many constraints). Second, say you want to learn a deterministic policy (if there is an optimal policy, there is a deterministic optimal policy). Then the expectation disappears, but you need to collect samples somehow. This is where the "behavior" policy comes in, which usually is just a noisy version of the policy you want to optimize (the most common are e-greedy or you add Gaussian noise if the action is continuous).
So now you have samples collected from a behavior policy and a target policy (deterministic) which you want to optimize.
The resulting equation is
Q(s,a) = R(s,a) + gamma*Q(s',pi(s'))
the difference between the two sides is the TD error and you want to minimize it given samples collected from the behavior policy
min E[R(s,a) + gamma*Q(s',pi(s')) - Q(s,a)]
where the expectation is approximated with samples (s,a,s')
collected using the behavior policy.
If we consider the pseudocode of Soroush, if actions are discrete, then pi(s') = max_A Q(s',A)
and the update rule is the derivative of the TD(0) error.
These are some good easy reads to learn more about TD: 1, 2, 3, 4.
EDIT
Just to underline the difference between on- and off-policy. SARSA is on-policy, because the TD error to update the policy is
min E[R(s,a) + gamma*Q(s',a') - Q(s,a)]
a'
is the action collected while sampling data using the behavior policy, and it's not pi(s')
(the action that the target policy would choose in state s'
).