1

I have a batch of urls that I want to fetch. The list contains urls (more then 50.000) with different domainnames but all domains use the same load balanced server ip.

For each url I want to log its result code, its fetch duration and the hash of the content and its redirect headers.

The current method gets around 10 fetches per second with response times of around half a second.

How can I have the following execute faster?

I currently have the following code construction:

Parallel.ForEach(domainnames, ProcessItem);

The ProcessItem is based on the following:

static void Fetch2(Uri url)
{
    HttpWebResponse response;
    try
    {
        var request = (HttpWebRequest)WebRequest.Create(url);
        request.AllowAutoRedirect = false;
        response = (HttpWebResponse)request.GetResponse())
    }
    catch (WebException ex)
    {
        response = ex.Response as HttpWebResponse;
    }

    if (response == null) return;

    using (response)
    {
        // Process response.....
    }
}

I have the following configuration applied:

<system.net>
    <connectionManagement>
        <add address="*" maxconnection="100" />
    </connectionManagement>
</system.net>

I tried the following:

  • Limit the Parallel.ForEach by specifying new ParallelOptions { MaxDegreeOfParallelism = 25 } as I thought that I maybe was doing to much web requests but even lowering it more does not result in improved performance.
  • Applying async with Task.WaitAll(Task[]) but this resulting in lots of errors as all tasks get created very fast but almost all result in connection errors.

Interesting observations are:

  • My internet network connection is not really under load so not congested
  • cpu, memory and IO are not really intesting either but IO shows dips.
svick
  • 236,525
  • 50
  • 385
  • 514
Ramon Smits
  • 2,482
  • 1
  • 18
  • 20
  • 1
    It's quite possible that the server you're accessing is throttling your connections. What happens if you change your max connection configuration to a smaller number? What happens if you increase it? Do you know how many active connections you *actually* have at any one time? – Jim Mischel Jan 29 '13 at 17:42
  • You might also be bottlenecked on domain name resolution. Have you tried either caching the domain name resolutions or using IP addresses directly? Is your host OS limiting the number of concurrent connections in some way? Your firewall/router might also be doing this. How many threads are actually running when you max out? You should consider making your code use async to minimize the number of stalled threads. – Ade Miller Nov 03 '13 at 22:25

0 Answers0