How to implement Q-learning to approximate an optimal control?

Question

I am interested in implementing Q-learning (or some form of reinforcement learning) to find an optimal protocol. Currently, I have a function written in Python where I can take in the protocol or "action" and "state" and returns a new state and a "reward". However, I am having trouble finding a Python implementation of Q-learning that I can use in this situation (i.e. something that can learn the function as if it is a black box). I have looked at OpenAI gym but that would require writing a new environment. Would anyone know of a simpler package or script that I can adopt for this?

My code is of the form:

def myModel (state, action, param1, param2):
    ...
    return (state, reward)

What I am looking for would be an algorithm of the form:

def QLearning (state, reward):
    ...
    return (action)

And some way to keeping the actions that transition between states. If anyone has any idea where to look for this, I would be very excited!

All RL toolboxes I know have separate code for the transition function and the policy, while it seems that your code merges them in `myModel`. I suggest you to adapt your code to have `action = policy(state,param1,param2)` and `(state,reward) = step(state,action)`. The `step` function will go into a gym environment (just copy and adapt one of the simplest environment, it is easier than you think). — Simon, Sep 09 '18 at 08:49
Thanks for the response! I'm a bit confused about `action = policy(state,param1,param2)` since the goal is to learn an optimal policy (if I'm not mistaken)? — tooty44, Sep 09 '18 at 20:59
Yes, the policy is parameterized and you learn the optimal params. What you do is: you start with some initial `params_0`, collect samples, update the params and get `params_1`, repeat until the optimal params (=policy) are learned. The collection of samples goes like: drawn the initial state, draw an action according to `policy(state,params_i)`, feed the action to the simulator `step(state,action)` to get the next state and reward, and repeat until some terminal condition is met (and usually also collect more trajectories like this until you have sufficient samples). — Simon, Sep 09 '18 at 21:11
So `params` is updated at each learning iterations. Usually the policy includes some kind of noise for exploration (you can learn the noise, or just manually decrease it). Alternatively, you can just collect samples on your own (use a random policy or also manually add the noise and decrease it over time), and when you have sufficient data run the learning algorithm without further interaction with the simulator and get the optimal policy "in one shot". With algorithm works better depends on the problem you want to solve. OpenAI has also a repository with RL algorithms of the first kind. — Simon, Sep 09 '18 at 21:14
A good repository for getting started with RL could be [this](https://github.com/higgsfield/RL-Adventure) (there is also a second one with more algorithms). Sorry for the many comments, but I don't think this really answer your question :D — Simon, Sep 09 '18 at 21:17
Ah, that makes sense! Thank you for the great help and the links! — tooty44, Sep 09 '18 at 22:20

score 4 · Accepted Answer · answered Sep 10 '18 at 17:13

A lot of the comments presented here require you to have deep knowledge of reinforcement learning. It seems that you are just getting started with reinforcement learning, so I would recommend starting with the most basic Q learning algorithm.

The best way to learn RL is to code the basic algorithm for yourself. The algorithm has two parts (model, agent) and it looks like this:

model(state, action):
    ...
    return s2, reward, done

where s2 is the new state the model entered after performing action, a. Reward is based on performing that action at that state. Done simply represents if it is the end of the episode or not. It seems like you have this part already.

The next part is the agent and looks like this:

states = [s1, s2, s3, ...]
actions = [a1, a2, a3, ...]
Q_matrix = np.zeros([state_size, action_size])
discount = 0.95
learning_rate = 0.1
action_list = []

def q_learning_action(s, Q_matrix):

    action = index_of_max(Q_matrix[s, :])
    action_list.append(action)      # Record your action as requested

    return action

def q_learning_updating(s, a, reward, s2, Q_matrix):

    Q_matrix[s, a] = (1 - learning_rate)Q_matrix[s, a] + learning_rate*(reward + gamma*maxQ_matrix[s2, a])
    s = s2

    return s, Q_matrix

With this, you can build a RL agent to learn many basic things for optimal control.

Basically, the Q_learning_actions gives you the action required to perform on the environment. Then using that action, calculate the models next state and reward. Then using all the information, update your Q-matrix with the new knowledge.

Let me know if anything doesn't make sense!

score 3 · Answer 2 · answered Sep 11 '18 at 22:10

I also suggest you to start with a standard Q-learning algorithm. Though if you really want to try an approximate Q-learning algorithm you can take any Atari game from openAI and try to solve the control problem

First of all you need to design a neural network policy.

import tensorflow as tf
import keras
import keras.layers as L
tf.reset_default_graph()
sess = tf.InteractiveSession()

keras.backend.set_session(sess)
network = keras.models.Sequential()
network.add(L.InputLayer(state_dim))

network.add(L.Dense(200, activation='relu'))
network.add(L.Dense(200, activation='relu'))
network.add(L.Dense(n_actions))

It's quite simple network but it will work. Also avoid using nonlinearities like sigmoid & tanh: agent's observations are not normalized so sigmoids may become saturated from init.

Then we sample action with epsilon-greedy policy

def get_action(state, epsilon=0):

    if np.random.random() < epsilon:
        return int(np.random.choice(n_actions))

    return int(np.argmax(q_values))

Then we need to train the agent's Q-function to minimize the TD-loss

When doing gradient descent, we won't propagate gradients through it to make training more stable

states_ph = keras.backend.placeholder(dtype='float32', shape=(None,) + state_dim)
actions_ph = keras.backend.placeholder(dtype='int32', shape=[None])
rewards_ph = keras.backend.placeholder(dtype='float32', shape=[None])
next_states_ph = keras.backend.placeholder(dtype='float32', shape=(None,) + state_dim)
is_done_ph = keras.backend.placeholder(dtype='bool', shape=[None])

#get q-values for all actions in current states
predicted_qvalues = network(states_ph)

#select q-values for chosen actions
predicted_qvalues_for_actions = tf.reduce_sum(predicted_qvalues * tf.one_hot(actions_ph, n_actions), axis=1)

gamma = 0.99

# compute q-values for all actions in next states
predicted_next_qvalues = network(next_states_ph)

# compute V*(next_states) using predicted next q-values
next_state_values = tf.reduce_max(predicted_next_qvalues, axis=1)

# compute "target q-values" for loss - it's what's inside square parentheses in the above formula.
target_qvalues_for_actions = rewards_ph + gamma*next_state_values

# at the last state we shall use simplified formula: Q(s,a) = r(s,a) since s' doesn't exist
target_qvalues_for_actions = tf.where(is_done_ph, rewards_ph, target_qvalues_for_actions)

Finally implement a mean squared error that you want to minimize

loss = (predicted_qvalues_for_actions - tf.stop_gradient(target_qvalues_for_actions)) ** 2
loss = tf.reduce_mean(loss)

# training function that resembles agent.update(state, action, reward, next_state) from tabular agent
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)

The remaining part is to generate sessions - play env with approximate q-learning agent and train it at the same time.

How to implement Q-learning to approximate an optimal control?

2 Answers2