We are two french mechanical engineering students interested in reinforcement learning trying to apply Q-learning to a rotary inverted pendulum for a project. We have watched David Silver's "youtube course" and read chapters of Sutton & Barto, the basic theory was easy... But we have yet to see any positive results on our pendulum.
Here is a picture of the rotary inverted pendulum we built and a graph of our latest test, showing the average reward per episode (in green). A computer running python code commutates with an Arduino which in turn controls a stepper motor. We have a rotary encoder which gives us the angle of the pendulum (from which we also calculate the angular velocity).
As a first step, we have chosen to use Q-learning in a discrete two-dimensional state-space (angular position and velocity). We have let our system run for many hours without any sign of improvement. We have tried varying the parameters of the algorithm, the actions possible, the number of states and their partitioning, etc. Also, our system tends to heat up quite a bit so we have separated the learning into episodes of about 200 steps followed by a brief period of rest. In order to increase speed and precision, we batch update the Q values at the end of each episode.
Here is our update function:
# Get Q values from database
Q_dict = agent.getAllQ()
E_dict = {}
# Set E_dict to 0 for all state-action pairs
for s,a in a_StateActionPairs:
E_dict[s + a] = 0
# Q Algorithm
# For every step
for i_r in episode_record:
state, action, new_state, new_action, greedy_action, R = i_r
# Get Q for current step and calculate target
Q = Q_dict[(state, action)]
target = R + GAMMA*Q_dict[(new_state,greedy_action)]
# Update E for visited state
E_dict[(state, action)] += 1
# Update Q for every state-action pair
for s,a in a_StateActionPairs:
updatedQ = Q_dict[(s,a)]+ALPHA*E_dict[(s,a)]*(target-Q)
Q_dict[(s,a)] = updatedQ
# Set E to 0 if new_action was chosen at random (epsilon-greedy)
if greedy_action == new_action:
E_dict[(s,a)] *= GAMMA*LAMBDA
else:
E_dict[(s,a)] = 0
# Update database
agent.setAllQ(Q_dict)
log.info('Qvalues updated')
Here is "main" part of the code: Github (https://github.com/Blabby/inverted-pendulum/blob/master/QAlgo.py)
Here our some of our hypothesis as to why are tests are unsuccessful: - The system has not run for a long enough time - Our exploration (e-greedy) in not adapted to the problem - The hyper-parameters are not optimized - Our physical system is too unpredictable
Does anyone have any experience applying reinforcement learning to physical systems? We have hit a roadblock and are looking for help and/or ideas.