1

I am using a script to scrape news from many websites using newspaper3k. Instead of running it sequentially I tried to utilize all of my cores by using joblib.Parallel

However, it still takes A LOT of time (50 websites take around 20 minutes). I profiled the script and it turns out the majority of the time (51%) is waiting on locks from Parallel:

enter image description here

Is there any way you think I can improve that? I thought of using async but turns out Joblib doesn't work too well with it.

user3666197
  • 1
  • 6
  • 50
  • 92
Dr. Prof. Patrick
  • 1,280
  • 2
  • 15
  • 27
  • 2
    don't use joblib and just use Python built-ins like `concurrent.futures`? it comes with a `ThreadPoolExecutor` – gold_cy Dec 29 '21 at 18:43
  • Thanks @gold_cy I am looking into it and will see if it improves my results – Dr. Prof. Patrick Dec 29 '21 at 19:10
  • well it's much much faster now! thank you so much! if you want you can write it as an answer and ill approve it. again, thanks! – Dr. Prof. Patrick Dec 29 '21 at 19:19
  • 2
    I’m glad it worked for you, but all I provided was a comment/suggestion, you figured out the rest yourself =) – gold_cy Dec 29 '21 at 19:45
  • +1 for profiling ;o) Ad-Hoc: improving runtimes, there is one more dimension to scale-up the performance. The trick of using asynchronous latency masking with more, just-[CONCURRENT] processing requests hits a principal Python ceiling of central-GIL-lock, so many-core architectures actually do not matter in this. If one uses a [tag:distributed-computing] trick (be it a localhost-only or many-hosts), the central Python interpreter can send as many tasks to as many (other) GIL-lock independent Python processes, that your scraping runtime gets as low as the "longest" E2E-latency + a few [ms] – user3666197 Jan 06 '22 at 06:34

0 Answers0