I'm working on a deep reinforcement learning project with TensorFlow and I am struggling with the training of a DQN agent of tf_agents module.
My project aims to simulate a fiscal society where there are tree possible actions: pay taxes, pay more taxes (voluntarily) and evade. I want my DQN agent to learn to maximize its wealth by learning the best action depending on the state of the environment. My DQN agent looks like this:
dqn_agent.DqnAgent(
self.entorno_entrenamiento.time_step_spec(),
self.entorno_entrenamiento.action_spec(),
q_network=self.q_network,
optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=1e-4),
td_errors_loss_fn=common.element_wise_squared_loss,
train_step_counter=tf.Variable(0))
The neural network is like this:
QNetwork(
self.entorno_entrenamiento.observation_spec(),
self.entorno_entrenamiento.action_spec(),
fc_layer_params=(128,64,64,32))
And the main training loop:
observer = [buffer.add_batch]
driver = dynamic_step_driver.DynamicStepDriver(
self.train_env,
random_policy,
observers=observer,
num_steps=50000).run()
dataset = buffer.as_dataset(
num_parallel_calls=3,
sample_batch_size=64,
num_steps=2).prefetch(3)
iterator = iter(dataset)
reward = 0.0
for i in range(n_iteraciones):
# Train the agent.
experience, _ = next(iterator)
loss = self.agenteDQN.train(experiencia).loss
step = self.agenteDQN.train_step_counter.numpy()
reward += experience.reward[0].numpy()[0]
# Evaluate the agent.
if step % log_interval == 0:
average_reward = recompensa / log_interval
reward = 0.0
print('step = {0}: loss = {1}, average reward = {2}'.format(step, loss, average_reward))
loss.append(loss)
rewards.append(average_reward)
My custom environment implements the py_environment interface from Tensorflow:
def calculate_reward(self, action, state, new_state):
fairness = state[0]
wealth = state[2]
old_wealth = new_state[2]
diff = wealth - old_wealth
if action == 0: # pay min
if fairness > 0.5:
reward = diff + 1
else:
reward = diff
elif action == 1: # pay max
reward = diff
else: # evade
if fairness < 0.6:
reward = diff + 1
else:
reward = diff
return reward
def _step(self, action):
if self._episode_ended:
# The last action ended the episode. Ignore the current action and start
# a new episode.
return self.reset()
self._n_steps += 1
old_state = self._state
self._agents.step(action)
self._state = self._agents.observation()
self._reward = self.calculate_reward(action, old_state, self._state)
if (self._n_steps >= 400) or (self._state[2] > 0.65 and self._n_steps >= 400):
self._episode_ended = True
if self._n_steps >= 400 and self._state[2] < 0.66:
return ts.termination(self._state, reward=0.0)
else:
return ts.termination(self._state, reward=10.0)
else:
return ts.transition(self._state, reward=self._reward, discount=1.0)
The state is a list of 3 float numbers between 0 and 1 both included ([fairness, gini_coeficient, agent_normalized_wealth])
When I train the agent 500000 the loss converges but the average reward is not increasing. I have tried to tune the hyperparameters but I still got the same problem. Any ideas of what should I try?