I have set-up a tf_agent with a normal neural network as a q-net to learn trajectories which works fine. However, I'd now like to try a QRnnNetwork and train/learn from sequences of events but can't get it to work.
The action and observation spec in my custom env look as follows.
self._action_spec = array_spec.BoundedArraySpec(
shape=(), dtype=np.int32, minimum=0, maximum=num_actions-1, name='action'
)
self._observation_spec = array_spec.BoundedArraySpec(
shape=(num_features,), dtype=np.int32, name='observation'
)
Next I've now setup the qnet
lstm_neurons = 50
train_sequence_length = 5
input_fc_layer_params=(40,)
output_fc_layer_params=(40,)
rnn_network = QRnnNetwork(
train_env.observation_spec(),
train_env.action_spec(),
lstm_size=(lstm_neurons,),
input_fc_layer_params=input_fc_layer_params,
output_fc_layer_params=output_fc_layer_params,
)
agent = dqn_agent.DqnAgent(
time_step_spec=train_env.time_step_spec(),
action_spec=train_env.action_spec(),
q_network=rnn_network,
optimizer=optimizer,
target_update_period=target_update_period,
td_errors_loss_fn=tf.keras.losses.Huber(reduction="none"),
gamma=discount,
epsilon_greedy=lambda: epsilon_fn(train_step_counter),
train_step_counter=train_step_counter
)
Next to generate and use training data I understood to pass train_sequence_length + 1 for num_steps, which was previously 2 and now more due to increased sequence size.
# replay buffer and driver for training
replay_buffer = TFUniformReplayBuffer(
agent.collect_data_spec,
batch_size=replay_buffer_batch_size,
max_length=replay_buffer_max_size
)
replay_buffer_observer = replay_buffer.add_batch
train_metrics = [tf_metrics.AverageReturnMetric()]
# Create q-policy to plot the learned q-values
qpolicy = QPolicy(train_env.time_step_spec(), train_env.action_spec(), rnn_network)
collect_driver = dynamic_step_driver.DynamicStepDriver(
train_env,
agent.collect_policy,
observers=[replay_buffer_observer] + train_metrics,
num_steps=collect_steps)
print('Initial data generation and setting up the dataset for training')
initial_collect_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())
init_driver = dynamic_step_driver.DynamicStepDriver(
train_env,
initial_collect_policy,
observers=[replay_buffer.add_batch, ShowProgress(replay_buffer_max_size)],
num_steps=replay_buffer_max_size)
final_time_step, final_policy_state = init_driver.run()
dataset = replay_buffer.as_dataset(sample_batch_size=16,
num_steps=(train_sequence_length + 1), #2
num_parallel_calls=4).prefetch(4)
collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)
Once I start training using the below commands, I receive the error "ValueError: Dimensions must be equal, but are 5 and 16 for '{{node gradient_tape/loss/mul_4/Mul}} = Mul[T=DT_FLOAT](loss/Cast, gradient_tape/loss/Tile_1)' with input shapes: [16,5], [16,16]." Where 16 is the sample_batch_size and 5 the sequence length I'm trying to pass. Please advice what I'm missing here.
time_step = None
policy_state = agent.collect_policy.get_initial_state(train_env.batch_size)
iterator = iter(dataset)
time_step, policy_state = collect_driver.run(time_step, policy_state)
trajectories, buffer_info = next(iterator)
train_loss = agent.train(trajectories)
The observations are now in shape (16, 6, 9), which seems to be correct (sample_batch_size, train_seq_length+1, nr_features). The code runs with num_steps=2.