How to choose the reward function for the cart-pole inverted pendulum task

Question

I am new in python or any programming language for that matter. For months now I have been working on stabilising the inverted pendulum. I have gotten everything working but struggling to get the right reward function. So far, after researching and trials and fails, the best I could come up with is

R=(x_dot**2)+0.001*(x**2)+0.1*(theta**2)

But I don't get to stability, this being theta=0 long enough.

Does anyone has an idea of the logic behind the ideal reward function?
Thank you.

Is this the pendulum or the cart-pole? I see `x` and I assume it is the x-coordinate of the cart, but your title says just pendulum. Also, is this a penalty cost? Because you usually want to penalize for high velocity / acceleration to have smooth trajectories. — Simon, Jul 25 '18 at 14:02
Yes Simon this is the cartpole problem and yes i want to establish a penalty cost. — Stevy KUIMI, Aug 03 '18 at 18:14

Simon · Answer 1 · 2018-08-04T08:58:05.287

For just the balancing problem (not the swing-up), even a binary reward is enough. Something like

Always 0, then -1 when the pole falls. Or,
Always 1, then 0 when the pole falls.

Which one to use depends on the algorithm used, the discount factor and the episode horizon. Anyway, the task is easy and both will do their job.

For the swing-up task (harder than just balancing, as the pole starts upside down and you need to swing it up by moving the cart) it is better to have a reward depending on the state. Usually, the simple cos(theta) is fine. You can also add a penalty for the angle velocity and for the action, in order to prefer slow-changing smooth trajectory. You can also add a penalty if the cart goes out of the boundaries of the x coordinate.
A cost including all these terms would look like this

reward = cos(theta) - 0.001*theta_d.^2 - 0.0001*action.^2 - 100*out_of_bound(x)

score 0 · Answer 2 · answered Aug 02 '18 at 15:04

0

I am working on inverted pendulum too. I found the following reward function which I am trying.

costs = angle_normalise((th)**2 +.1*thdot**2 + .001*(action**2))
# normalize between -pi and pi
reward=-costs

but still have a problem in choosing the actions, maybe we can discuss.

answered Aug 02 '18 at 15:04

sara

1
1

This is a common reward, used for instance in OpenAI Gym https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py. It will work. – Simon Aug 04 '18 at 08:54

How to choose the reward function for the cart-pole inverted pendulum task

2 Answers2