0

I'm working on a deep reinforcement learning project with TensorFlow and I am struggling with the training of a DQN agent of tf_agents module.

My project aims to simulate a fiscal society where there are tree possible actions: pay taxes, pay more taxes (voluntarily) and evade. I want my DQN agent to learn to maximize its wealth by learning the best action depending on the state of the environment. My DQN agent looks like this:

dqn_agent.DqnAgent(
    self.entorno_entrenamiento.time_step_spec(),
    self.entorno_entrenamiento.action_spec(),
    q_network=self.q_network,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=1e-4),
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=tf.Variable(0))

The neural network is like this:

QNetwork(
    self.entorno_entrenamiento.observation_spec(),
    self.entorno_entrenamiento.action_spec(),
    fc_layer_params=(128,64,64,32))

And the main training loop:

    observer = [buffer.add_batch]
    driver = dynamic_step_driver.DynamicStepDriver(
        self.train_env,
        random_policy,
        observers=observer,
        num_steps=50000).run()

    dataset = buffer.as_dataset(
        num_parallel_calls=3,
        sample_batch_size=64,
        num_steps=2).prefetch(3)

    iterator = iter(dataset)
    reward = 0.0
    for i in range(n_iteraciones):
        # Train the agent.
        experience, _ = next(iterator)
        loss = self.agenteDQN.train(experiencia).loss
        step = self.agenteDQN.train_step_counter.numpy()
        reward += experience.reward[0].numpy()[0]
        # Evaluate the agent.
        if step % log_interval == 0:
            average_reward = recompensa / log_interval
            reward = 0.0
            print('step = {0}: loss = {1}, average reward = {2}'.format(step, loss, average_reward))
            loss.append(loss)
            rewards.append(average_reward)

My custom environment implements the py_environment interface from Tensorflow:

def calculate_reward(self, action, state, new_state):
    fairness = state[0]
    wealth = state[2]
    old_wealth = new_state[2]
    diff = wealth - old_wealth

    if action == 0: # pay min
        if fairness > 0.5:
            reward = diff + 1
        else:
            reward = diff
    elif action == 1: # pay max
        reward = diff
    else: # evade
        if fairness < 0.6:
            reward = diff + 1
        else:
            reward = diff
    return reward

def _step(self, action):
    if self._episode_ended:
        # The last action ended the episode. Ignore the current action and start
        # a new episode.
        return self.reset()

    self._n_steps += 1

    old_state = self._state
    self._agents.step(action)
    self._state = self._agents.observation()
    self._reward = self.calculate_reward(action, old_state, self._state)

    if (self._n_steps >= 400) or (self._state[2] > 0.65 and self._n_steps >= 400):
        self._episode_ended = True
        if self._n_steps >= 400 and self._state[2] < 0.66:
            return ts.termination(self._state, reward=0.0)
        else:
            return ts.termination(self._state, reward=10.0)
    else:
        return ts.transition(self._state, reward=self._reward, discount=1.0)

The state is a list of 3 float numbers between 0 and 1 both included ([fairness, gini_coeficient, agent_normalized_wealth])

When I train the agent 500000 the loss converges but the average reward is not increasing. I have tried to tune the hyperparameters but I still got the same problem. Any ideas of what should I try?

Willy
  • 1
  • 1

1 Answers1

0

The line:

reward += experience.reward[0].numpy()[0]

is not clear to me. Where can we see experience? Does this line update correctly?

please share the entire code

also, looking at your reward function I can see:

def calculate_reward(self, action, state, new_state):
    fairness = state[0]
    wealth = state[2]
    old_wealth = new_state[2]

I assume that wealth will be the current step's wealth, and old_wealth would be the previous step's wealth, and they would correspond to new_state, and state respectfully, and I see in your case it's the other way around, try and print the diff value, see if it changes correctly, I'm assuming it's negative

Another thing that I am seeing here that might be problematic is that you are adding the previous state to the calculation of the reward, which might interfere with the Markovian property that the environment needs to establish, I advise adding the previous state to the current state by concatenating them together. See here for a better explanation