Background: I have a huge DataFrame with 40 million rows. I have to run some functions on some columns. The loops were taking too long, so I decided to go with Multiprocessing. CPU: 8 cores 16 threads RAM: 128 GB
Question: How many chunks should I break the data into? And how many workers are properly for this dataset?
p.s. I found out that when I set the max_workers = 15, all threads are running 100%. But if I change the max_workers to 40, they dropped to 40%.
Thank you!