How to improve running time with joblib.Parallel and web requests?

Question

I am using a script to scrape news from many websites using newspaper3k. Instead of running it sequentially I tried to utilize all of my cores by using joblib.Parallel

However, it still takes A LOT of time (50 websites take around 20 minutes). I profiled the script and it turns out the majority of the time (51%) is waiting on locks from Parallel:

Is there any way you think I can improve that? I thought of using async but turns out Joblib doesn't work too well with it.

don't use joblib and just use Python built-ins like `concurrent.futures`? it comes with a `ThreadPoolExecutor` — gold_cy, Dec 29 '21 at 18:43
Thanks @gold_cy I am looking into it and will see if it improves my results — Dr. Prof. Patrick, Dec 29 '21 at 19:10
well it's much much faster now! thank you so much! if you want you can write it as an answer and ill approve it. again, thanks! — Dr. Prof. Patrick, Dec 29 '21 at 19:19
I’m glad it worked for you, but all I provided was a comment/suggestion, you figured out the rest yourself =) — gold_cy, Dec 29 '21 at 19:45
+1 for profiling ;o) Ad-Hoc: improving runtimes, there is one more dimension to scale-up the performance. The trick of using asynchronous latency masking with more, just-[CONCURRENT] processing requests hits a principal Python ceiling of central-GIL-lock, so many-core architectures actually do not matter in this. If one uses a [tag:distributed-computing] trick (be it a localhost-only or many-hosts), the central Python interpreter can send as many tasks to as many (other) GIL-lock independent Python processes, that your scraping runtime gets as low as the "longest" E2E-latency + a few [ms] — user3666197, Jan 06 '22 at 06:34

How to improve running time with joblib.Parallel and web requests?

0 Answers0