0

I am new in python or any programming language for that matter. For months now I have been working on stabilising the inverted pendulum. I have gotten everything working but struggling to get the right reward function. So far, after researching and trials and fails, the best I could come up with is

R=(x_dot**2)+0.001*(x**2)+0.1*(theta**2)

But I don't get to stability, this being theta=0 long enough.

Does anyone has an idea of the logic behind the ideal reward function?
Thank you.

Simon
  • 5,070
  • 5
  • 33
  • 59
Stevy KUIMI
  • 47
  • 2
  • 6
  • 1
    Is this the pendulum or the cart-pole? I see `x` and I assume it is the x-coordinate of the cart, but your title says just pendulum. Also, is this a penalty cost? Because you usually want to penalize for high velocity / acceleration to have smooth trajectories. – Simon Jul 25 '18 at 14:02
  • Yes Simon this is the cartpole problem and yes i want to establish a penalty cost. – Stevy KUIMI Aug 03 '18 at 18:14

2 Answers2

1

For just the balancing problem (not the swing-up), even a binary reward is enough. Something like

  • Always 0, then -1 when the pole falls. Or,
  • Always 1, then 0 when the pole falls.

Which one to use depends on the algorithm used, the discount factor and the episode horizon. Anyway, the task is easy and both will do their job.

For the swing-up task (harder than just balancing, as the pole starts upside down and you need to swing it up by moving the cart) it is better to have a reward depending on the state. Usually, the simple cos(theta) is fine. You can also add a penalty for the angle velocity and for the action, in order to prefer slow-changing smooth trajectory. You can also add a penalty if the cart goes out of the boundaries of the x coordinate.
A cost including all these terms would look like this

reward = cos(theta) - 0.001*theta_d.^2 - 0.0001*action.^2 - 100*out_of_bound(x)
Simon
  • 5,070
  • 5
  • 33
  • 59
0

I am working on inverted pendulum too. I found the following reward function which I am trying.

costs = angle_normalise((th)**2 +.1*thdot**2 + .001*(action**2))
# normalize between -pi and pi
reward=-costs

but still have a problem in choosing the actions, maybe we can discuss.

sara
  • 1
  • 1
  • This is a common reward, used for instance in OpenAI Gym https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py. It will work. – Simon Aug 04 '18 at 08:54