I'm currently running a python script against multiple web server. The general task is to find out broken (external) links within a cms. Script runs pretty well so far but in reason I test around 50 internal projects and each with several hundreds sub pages. This ends in several thousands external links i have to check.
For that reason I added multi-threading - improves performance as it was my wish. But here comes the problem. If there is a page to check which contains a list of links to the same server (bundle of known issues or tasks to do) it will slow down the destination system. I neither would like to slow my own server nor server that are not mine.
Currently I running up to 20 threads and than waiting 0.5s until a "thread position" is ready to use. To check if a URL is broken I deal with urlopen(request) coming from urllib2 and log every time it throws an HTTPError. Back to the list of multiple URLs to the same server... my script will "flood" the web server with - cause of multi-threading - up to 20 simultaneous requests.
Just that you have an idea in which dimensions this script runs/URLs have to check: Using only 20 threads "slows" down the current script for only 4 projects to 45min running time. And this is only checking .. Next step will be to check broken URLs for . Using the current script shows us some peaks with 1000ms response time within server monitoring.
Does everyone has an idea how to improve this script in general? Or is there a much better way to check this big amount of URLs? Maybe a counter that pause the thread if there are 10 requests to a single destination?
Thanks for all suggestions