I am reading Sutton and Barto and want to make sure I am clear.
For Off Policy learning can we think of a robot in a particular terrain - say on sand - as the target policy but use the robot's policy for walking in snow as the behaviour policy? We are using our experience of walking on snow to approximate the optimal policy for walking on sand?