Short intro: I'm the lead developer on a webcrawler project. It's a fairly mature project and on a daily basis we will execute around 3-70.000 individual crawlers. We have a mid-size serverfarm with each server running between 100-400 crawlers at a time.
The issue: We are seeing intermittent failures when accessing HTTPS/TLS sites, but only when the running server is a Windows 2008, we have no issues on our 2003 installations. Crawlers will be running, and suddenly none of them will be able to perform webrequests against HTTPS sites any more. They simply wait for their allotted timeperiod and then time out. They all fail in unison, new crawlers started while the issue is present will also fail.
The solution: Opening up an internet explorer instance on the affected server and going to a random HTTPS/TLS site will clear this issue up. Suddenly all the crawlers will stop getting any timeouts and simply work as they are supposed to. Sometimes more than a week will pass without a server experiencing this problem.
The question: Does anyone have a clue what is going on here? Our current solution is to launch an internet explorer daily on all windows 2008 servers and point it at https site, in the hopes of catching this before it becomes too much of an issue. That is very unsatisfying, and really won't scale properly.