I am currently testing out different num_workers in DataLoader in PyTorch and it seems that 0 has the shortest running time.
I also tried out https://github.com/developer0hye/Num-Workers-Search, which is an automated num_workers search based on dataset and batch_size (and some other parameters), and it also gives 0 as the ideal num_workers.
The CPU itself is a server based AMD Epyc with 128 cores (256 threads), running on Ubuntu 20.04.
Can the CPU processing power be the answer to why the ideal num_workers is set to 0? It is a bit counter-intuitive, especially with a large number of threads.