0

I have a crawler that uses cURL to scrape data from many arrays of URLs but this is rather slow and I'd like to accelerate it using multi-threading by forking into several child processes that run concurrently.

The question is how do I determine the optimal number of threads? I have a decent dedicated server, but I'm not sure how to calculate and allocate those resources to run my scripts in the least amount of time.

Orun
  • 4,383
  • 3
  • 26
  • 44
  • i have *usullay* found 1 thread per core is optimal, but you should test on your system –  Mar 04 '15 at 21:08
  • If 99.9% of your script's execution time is waiting for I/O (i.e. data from the requested URLs), server resources are irrelevant. – lafor Mar 04 '15 at 21:20
  • It hardly depends on your data processing. I would start from 1 thread from core as @Dagon said. And if you still have resouces keep adding more threads until your load average will be close to core number you have or you will be running out of mem/IO. – SeriousDron Mar 04 '15 at 21:54
  • But wouldn't using multiple processes enable me to make multiple requests in parallel instead of waiting for responses? – Orun Mar 04 '15 at 22:52
  • the only way to find out is to test if your self. –  Mar 04 '15 at 23:02

0 Answers0