Ray RLlib: Why is the learn throughput decreasing in DQN Training?

Asked Jun 06 '20 at 16:48

Active Jun 06 '20 at 16:48

Viewed 349 times

Is it normal to have the learn throughput decrease and the learn time increase as a Dueling DDQN agent gets trained?

A 7X increase in learn time after a few hours of training is pretty significant, is this something that you will expect to see? My system is only using 20% of the 8 core CPU and 25 GB out of 64 GB memory.

A ray.rllib.agents.dqn model is currently being trained on the CPU. The config are all defaults except

config['timesteps_per_iteration'] = 5000
config['noisy'] = True
config['compress_observations'] = True
config["num_workers"] = 4
config["num_envs_per_worker"] = 8
config["eager"] = True

After further training, the learn throughput has plummeted even more to 20. CPU usage remains at around 20%, memory usage at 50 GB.

Using Ray 0.8.5, TensorFlow 2.2.0, Python 3.8.3, Ubuntu 18.04 inside WSL2

asked Jun 06 '20 at 16:48

Nyxynyx

61,411
155
482
830

1

I think something important to consider is if your episodes get shorter or extended as the agent gets better. Learning time may increase because your episodes get longer. – ivallesp Nov 23 '20 at 09:41

Ray RLlib: Why is the learn throughput decreasing in DQN Training?

0 Answers0