4

I have a problem with determining how to detect completion within a looping TPL Dataflow.

I have a feedback loop in part of a dataflow which is making GET requests to a remote server and processing data responses (transforming these with more dataflow then committing the results).

The data source splits its results into pages of 1000 records, and won't tell me how many pages it has available for me. I have to just keep reading until i get less than a full page of data.

Usually the number of pages is 1, frequently it is up to 10, every now and again we have 1000s.

I have many requests to fetch at the start.
I want to be able to use a pool of threads to deal with this, all of which is fine, I can queue multiple requests for data and request them concurrently. If I stumble across an instance where I need to get a big number of pages I want to be using all of my threads for this. I don't want to be left with one thread churning away whilst the others have finished.

The issue I have is when I drop this logic into dataflow, such as:

//generate initial requests for activity
var request = new TransformManyBlock<int, DataRequest>(cmp => QueueRequests(cmp));

//fetch the initial requests and feedback more requests to our input buffer if we need to
TransformBlock<DataRequest, DataResponse> fetch = null;
fetch = new TransformBlock<DataRequest, DataResponse>(async req =>
{
    var resp = await Fetch(req);

    if (resp.Results.Count == 1000)
        await fetch.SendAsync(QueueAnotherRequest(req));

    return resp;
}
, new ExecutionDataflowBlockOptions {  MaxDegreeOfParallelism = 10 });

//commit each type of request
var commit = new ActionBlock<DataResponse>(async resp => await Commit(resp));

request.LinkTo(fetch);
fetch.LinkTo(commit);

//when are we complete?

QueueRequests produces an IEnumerable<DataRequest>. I queue the next N page requests at once, accepting that this means I send slightly more calls than I need to. DataRequest instances share a LastPage counter to avoid neadlessly making requests that we know are after the last page. All this is fine.

The problem:
If I loop by feeding back more requests into fetch's input buffer as I've shown in this example, then i have a problem with how to signal (or even detect) completion. I can't set completion on fetch from request, as once completion is set I can't feedback any more.

I can monitor for the input and output buffers being empty on fetch, but I think I'd be risking fetch still being busy with a request when I set completion, thus preventing queuing requests for additional pages.

I could do with some way of knowing that fetch is busy (either has input or is busy processing an input).

Am I missing an obvious/straightforward way to solve this?

  • I could loop within fetch, rather than queuing more requests. The problem with that is I want to be able to use a set maximum number of threads to throttle what I'm doing to the remote server. Could a parallel loop inside the block share a scheduler with the block itself and the resulting thread count be controlled via the scheduler?

  • I could create a custom transform block for fetch to handle the completion signalling. Seems like a lot of work for such a simple scenario.

Many thanks for any help offered!

VMAtm
  • 27,943
  • 17
  • 79
  • 125
ajk
  • 41
  • 7
  • Do you now the moment when all requests are generated in the first block? – VMAtm Nov 28 '16 at 02:46
  • yeah, to start the pipeline, I call `foreach (var c in todolist) { request.Post(c); };`. Then I can call `request.Complete();`as I wont add any more requests. – ajk Nov 28 '16 at 06:19
  • @ajk, if that's what you're doing, why don't you simply use `a.LinkTo(b, new DataflowLinkOptions { PropagateCompletion = true })` on all your block links? Then calling `request.Complete()` will cause `commit.Completion` to transition to completed state once all items have passed through all stages of your pipeline, naturally. – Kirill Shlenskiy Nov 28 '16 at 09:03
  • @KirillShlenskiy . Yeah that would be nice, but after fetch is in the completion state it wont accept any more messages, which is what fetch itself is producing. So the line await fetch.SendAsync doesnt succeed. – ajk Nov 28 '16 at 10:56
  • Related: [How to mark a TPL dataflow cycle to complete?](https://stackoverflow.com/questions/26130168/how-to-mark-a-tpl-dataflow-cycle-to-complete) – Theodor Zoulias Jun 25 '20 at 21:02

2 Answers2

1

In TPL Dataflow, you can link the blocks with DataflowLinkOptions with specifying the propagation of completion of the block:

request.LinkTo(fetch, new DataflowLinkOptions { PropagateCompletion = true });
fetch.LinkTo(commit, new DataflowLinkOptions { PropagateCompletion = true });

After that, you simply call the Complete() method for the request block, and you're done!

// the completion will be propagated to all the blocks
request.Complete();

The final thing you should use is Completion task property of the last block:

commit.Completion.ContinueWith(t =>
    {
        /* check the status of the task and correctness of the requests handling */
    });
VMAtm
  • 27,943
  • 17
  • 79
  • 125
  • Hi @VMAtm, yes as discussed in the comments above, this is understood. However once completion has been propagated to Fetch, Fetch can no longer post more messages to its input buffer. Fetch feeds back messages to itself if, when it gets response, it discovers that more data is available. When completion is set on fetch this feedback method is no longer allowed. – ajk Nov 28 '16 at 15:40
  • Well then you simply propagate the completion from `fetch` to `commit` and use the `request.Completion.ContinueWith` loop for checking the `fetch` state, as you do in your answer – VMAtm Nov 28 '16 at 15:49
  • many thanks. I was unsure if there was a better way to know fetch is complete, but I can live with this if not! – ajk Nov 28 '16 at 16:00
0

For now I have added a simple busy state counter to the fetch block:-

int fetch_busy = 0;

TransformBlock<DataRequest, DataResponse>  fetch_activity=null;
fetch = new TransformBlock<DataRequest, ActivityResponse>(async req => 
    {
        try
        {
            Interlocked.Increment(ref fetch_busy);
            var resp = await Fetch(req);

            if (resp.Results.Count == 1000)
            {
                await fetch.SendAsync( QueueAnotherRequest(req) );
            }

            Interlocked.Decrement(ref fetch_busy);
            return resp;
        }
        catch (Exception ex)
        {
            Interlocked.Decrement(ref fetch_busy);
            throw ex;
        }
    }
    , new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 10 });

Which I then use to signal complete as follows:-

request.Completion.ContinueWith(async _ =>
    {
        while ( fetch.InputCount > 0 || fetch_busy > 0 )
        {
            await Task.Delay(100);
        }

        fetch.Complete();
    });

Which doesnt seem very elegant, but should work I think.

ajk
  • 41
  • 7
  • 2
    My understanding is that `SendAsync` returns as soon as it validates that the next block is happy to accept the new item. It is therefore possible for `await fetch.SendAsync` to finish (followed by decrementing `fetch_busy`) before the "inner" transformation has had the chance to increment `fetch_busy` again. During that time it's possible for the `fetch` block to be marked as complete by your continuation (if `fetch_busy` and `fetch.InputCount` both happen to be zero). If the in-flight inner `Fetch` task then produces 1000 items and tries another `SendAsync`, it will fail quietly. – Kirill Shlenskiy Nov 28 '16 at 14:02
  • This is obviously a pretty far-fetched but not inconceivable scenario, so perhaps you should throw if `await fetch.SendAsync` returns `false`. Also remember: `ContinueWith` with an async lambda as an argument returns a `Task` (this could lead to surprises if you ever decide to do anything with the result). – Kirill Shlenskiy Nov 28 '16 at 14:02
  • @KirillShlenskiy thanks for this, yes i'll investigate. I'd tried to mitigate that with the check for `fetch.InputCount > 0` within the `ContinueWith`. Are you saying that `await fetch.SendAsync` can return before the newly queued request would show in `fetch.InputCount` ? – ajk Nov 28 '16 at 15:54
  • no, that would be unlikely. However, `fetch.InputCount`/`fetch_busy` check can occur right after the inner fetch grabs the item to process (thereby decreasing `InputCount` - potentially to zero), but before it has had the chance to do any of its work (i.e. increment `fetch_busy`). – Kirill Shlenskiy Nov 28 '16 at 21:28