6

I'm developing a .NET4-based application that has to request third-party servers in order to get information from them. I'm using HttpClient to make these HTTP requests.

I have to create a hundred or a thousand requests in a short period of time. I would like to throttle the creation of these request to a limit (defined by a constant or something) so the other servers don't receive a lot of requests.

I've checked this link out that shows how to reduce the amount of tasks created at any time.

Here is my non-working approach:

// create the factory
var factory = new TaskFactory(new LimitedConcurrencyLevelTaskScheduler(level));

// use the factory to create a new task that will create the request to the third-party server
var task = factory.StartNew(() => {
    return new HttpClient().GetAsync(url);
}).Unwrap();

Of course, the problem here is that even that one task at the time is created, a lot of requests will be created and processed at the same time, because they run in another scheduler. I could not find the way to change the scheduler to the HttpClient.

How should I handle this situation? I would like limit the amount of request created to a certain limit but do not block waiting for these request to finish.

Is this possible? Any ideas?

Mauro Ciancio
  • 436
  • 5
  • 18
  • How are you calling the code you posted? Do you have a collection of URLs that you're using in a `foreach` loop, or something like that? – svick Nov 29 '12 at 08:35
  • Exactly, I have a collection of URLs and I convert them into a collection of Tasks. Each mapping is performed using the code posted above. – Mauro Ciancio Nov 29 '12 at 15:25

4 Answers4

1

If you can use .Net 4.5, one way would be to use TransformBlock from TPL Dataflow and set its MaxDegreeOfParallelism. Something like:

var block = new TransformBlock<string, byte[]>(
    url => new HttpClient().GetByteArrayAsync(url),
    new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = level });

foreach (var url in urls)
    block.Post(url);

block.Complete();

var result = new List<byte[]>();

while (await block.OutputAvailableAsync())
    result.Add(block.Receive());

There is also another way of looking at this, through ServicePointManager. Using that class, you can set limits on MaxServicePoints (how many servers can you be connected to at once) and DefaultConnectionLimit (how many connections can there be to each server). This way, you could start all your Tasks at the same moment, but only a limited amount of them would actually do something. Although limiting the number of Tasks (e.g. by using TPL Dataflow, as I suggested above) will be most likely more efficient.

svick
  • 236,525
  • 50
  • 385
  • 514
0

First, you should consider partitioning the workload according to website, or at least expose an abstraction that lets you choose how to partition the list of urls. e.g., one strategy could be by second-level domain e.g. yahoo.com, google.com.

The other thing is that if you are doing serious crawling, you may want to consider doing it on a cloud instead. That way each node in the cloud can crawl a different partition. When you say "short period of time", you are already setting yourself up for failure. You need hard numbers on what you want to attain.

The other key benefit to partitioning well is you can also avoid hitting servers during their peak hours and risking IP bans at their router level, in the case that the site doesn't simply throttle you.

John Zabroski
  • 2,212
  • 2
  • 28
  • 54
0

You might consider launching a fixed set of threads. Each thread does the client net operations serially; maybe also pausing at certain points in order to throttle. This will give you specific control over loading; you can change your throttle policies and change the number of threads.

seand
  • 5,168
  • 1
  • 24
  • 37
  • Yes, I know that spawning my own threadpool is the way to go. But I was looking for a solution that involved using the async .net framework. Is this clear? – Mauro Ciancio Nov 29 '12 at 15:01
  • You can keep a counter of async requests. Increment when adding an async net operation and decrement from the completion handler (or whatever-its-called. I a little rusty on this). To throttle you have to somehow defer new async requests when your counter exceeds n. You might have a single background thread just for this purpose. – seand Nov 30 '12 at 02:47
  • Good point. Is there any way to get notified when a http request is completed? I mean, I can easily increment this counter whenever I create a new HttpClient, but... when do I decrement this counter? I haven't found a hook in HttpClient that will be invoked when the request is done. – Mauro Ciancio Nov 30 '12 at 17:06
0

You might consider creating a new DelegatingHandler to sit in the request/response pipeline of the HTTPClient that could keep count of the the number of pending requests.

Generally a single HTTPClient instance is used to process multiple requests. Unlike HttpWebRequest, disposing a HttpClient instance closes the underlying TCP/IP connection, so if you want to reuse connections you really need to re-use HTTPClient instances.

Darrel Miller
  • 139,164
  • 32
  • 194
  • 243
  • Not quite right... (11 years later!) Keep in mind that `DelegatingHandler`s are pooled and disposed by `DefaultHttpClientFactory` unless you use `SetHandlerLifetime(Timeout.InfiniteTimeSpan)`... And it's the http message handlers that own the network connections, not `HttpClient` (which can be reused all day long). – Christian Davén Mar 07 '23 at 13:31
  • Yes, when they implemented the DefaultHttpClientFactory they chose to pool the pipeline instead of just keeping the HttpClient instance around. However, for quite a while, you could only use the DefaultHttpClientFactory if you were building an ASP.NET Web App. Many of us were/are not. IMO it was a bad decision to create HttpClientFactory, because HttpClient was intended to be reused and now DefaultHttpClientFactory creates a new instance for every call, making properties like DefaultRequestHeaders pretty much useless. – Darrel Miller May 12 '23 at 12:50