0

Background: I have a huge DataFrame with 40 million rows. I have to run some functions on some columns. The loops were taking too long, so I decided to go with Multiprocessing. CPU: 8 cores 16 threads RAM: 128 GB

Question: How many chunks should I break the data into? And how many workers are properly for this dataset?

p.s. I found out that when I set the max_workers = 15, all threads are running 100%. But if I change the max_workers to 40, they dropped to 40%.

Thank you!

1 Answers1

0

There are three types of parallel computing. Those are io-intensive, cpu-intensive and io-cpu intensive computing. If your thread is runing on cpu-intensive task, then You can increase your worker numbers as you want to get better performance. But If it is runing on io-intensive, it willl no effect to increase them.

You seems to be working on io-cpu intensive task. So If you increase worker numbers, you can get good result until there is no competition for using io resource(hard disk) so In local machine . it is not a good choice to increase worker numbers.

You can use Hadoop on GPS or AWS for this work.

dauren slambekov
  • 378
  • 3
  • 15