I also suggest you to start with a standard Q-learning algorithm. Though if you really want to try an approximate Q-learning algorithm you can take any Atari game from openAI and try to solve the control problem
First of all you need to design a neural network policy.
import tensorflow as tf
import keras
import keras.layers as L
tf.reset_default_graph()
sess = tf.InteractiveSession()
keras.backend.set_session(sess)
network = keras.models.Sequential()
network.add(L.InputLayer(state_dim))
network.add(L.Dense(200, activation='relu'))
network.add(L.Dense(200, activation='relu'))
network.add(L.Dense(n_actions))
It's quite simple network but it will work. Also avoid using nonlinearities like sigmoid & tanh: agent's observations are not normalized so sigmoids may become saturated from init.
Then we sample action with epsilon-greedy policy
def get_action(state, epsilon=0):
if np.random.random() < epsilon:
return int(np.random.choice(n_actions))
return int(np.argmax(q_values))
Then we need to train the agent's Q-function to minimize the TD-loss

When doing gradient descent, we won't propagate gradients through it to make training more stable
states_ph = keras.backend.placeholder(dtype='float32', shape=(None,) + state_dim)
actions_ph = keras.backend.placeholder(dtype='int32', shape=[None])
rewards_ph = keras.backend.placeholder(dtype='float32', shape=[None])
next_states_ph = keras.backend.placeholder(dtype='float32', shape=(None,) + state_dim)
is_done_ph = keras.backend.placeholder(dtype='bool', shape=[None])
#get q-values for all actions in current states
predicted_qvalues = network(states_ph)
#select q-values for chosen actions
predicted_qvalues_for_actions = tf.reduce_sum(predicted_qvalues * tf.one_hot(actions_ph, n_actions), axis=1)
gamma = 0.99
# compute q-values for all actions in next states
predicted_next_qvalues = network(next_states_ph)
# compute V*(next_states) using predicted next q-values
next_state_values = tf.reduce_max(predicted_next_qvalues, axis=1)
# compute "target q-values" for loss - it's what's inside square parentheses in the above formula.
target_qvalues_for_actions = rewards_ph + gamma*next_state_values
# at the last state we shall use simplified formula: Q(s,a) = r(s,a) since s' doesn't exist
target_qvalues_for_actions = tf.where(is_done_ph, rewards_ph, target_qvalues_for_actions)
Finally implement a mean squared error that you want to minimize
loss = (predicted_qvalues_for_actions - tf.stop_gradient(target_qvalues_for_actions)) ** 2
loss = tf.reduce_mean(loss)
# training function that resembles agent.update(state, action, reward, next_state) from tabular agent
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
The remaining part is to generate sessions - play env with approximate q-learning agent and train it at the same time.