4

I have a Dataflow pipeline consisting of several blocks. When elements are flowing through my processing pipeline, I want to group them by field A. To do this I have a BatchBlock with high BoundedCapacity. In it I store my elements until I decide that they should be released. So I invoke TriggerBatch() method.

private void Forward(TStronglyTyped data)
{
    if (ShouldCreateNewGroup(data))
    {
        GroupingBlock.TriggerBatch();
    }

 GroupingBlock.SendAsync(data).Wait(SendTimeout);
}

This is how it looks. The problem is, that the batch produced, sometimes contains the next posted element, which shouldn't be there.

To illustrate:

BatchBlock.InputQueue = {A,A,A}
NextElement = B //we should trigger a Batch!
BatchBlock.TriggerBatch()
BatchBlock.SendAsync(B);

In this point I expect my batch to be {A,A,A}, but it is {A,A,A,B}

Like TriggerBatch() was asynchronous, and SendAsync was in fact executed before the batch was actually made.

How can I solve this? I obviously don't want to put Task.Wait(x) in there (I tried, and it works, but then performance is poor, of course).

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
wojciech_rak
  • 2,276
  • 2
  • 21
  • 30
  • 1
    You don't explain how you call `Forward` but almost certainly the issue is that another message was posted to it between the call to `ShouldCreate` and `TriggerBatch`. There's nothing wrong with it, just the way it's supposed to work. You *shouldn't* be trying to trigger the BatchBlock from the outside. The only way to avoid such issues is to trigger it from the inside. Create a custom block with DataflowBlock.Encapsulate that exposes an ActionBlock as input and BatchBlock or BufferBlock as output. In the ActionBlock, check the input and either add the message or trigger the batch – Panagiotis Kanavos Mar 09 '16 at 15:27
  • 1
    Check [this example](https://msdn.microsoft.com/en-us/library/hh228606(v=vs.110).aspx) that creates a SlidingWindow block using Encapsulate using an ActionBlock, a Queue for storage and a BufferBlock for output – Panagiotis Kanavos Mar 09 '16 at 15:29
  • `Forward` is called from ActionBlock that is preceding BatchBlock. I have disabled Parallelism, so each block should process only 1 message at a time, right? – wojciech_rak Mar 09 '16 at 16:51
  • Who is posting to the BatchBlock though? It *can't* be linked to an ActionBlock, so where does it get its data from? In any case, you don't need a BatchBlock, you can use a simple Queue, List etc and simply post an array of all cahced objects when appropriate. This is what the SlidingWindow example does. – Panagiotis Kanavos Mar 09 '16 at 16:52
  • You are right. I have slightly modified example of SlidingWindow. In `ActionBlock` part, I am checking if current data should be pushed outside. Now everything works as I wanted. Thanks! – wojciech_rak Mar 09 '16 at 17:17

2 Answers2

5

I also encountered this issue by trying to call TriggerBatch in the wrong place. As mentioned, the SlidingWindow example using DataflowBlock.Encapsulate is the answer here, but it took some time to adapt so I thought I'd share my completed block.

My ConditionalBatchBlock creates batches up to a maximum size, possibly sooner if a certain condition is met. In my specific scenario I needed to create batches of 100, but always create new batches when certain changes in the data were detected.

public static IPropagatorBlock<T, T[]> CreateConditionalBatchBlock<T>(int batchSize, Func<Queue<T>, T, bool> condition)
{
    var queue = new Queue<T>();

    var source = new BufferBlock<T[]>();

    var target = new ActionBlock<T>(async item =>
    {
        // start a new batch if required by the condition
        if (condition(queue, item))
        {
            await source.SendAsync(queue.ToArray());
            queue.Clear();
        }

        queue.Enqueue(item);

        // always send a batch when the max size has been reached
        if (queue.Count == batchSize)
        {
            await source.SendAsync(queue.ToArray());
            queue.Clear();
        }
    });

    // send any remaining items
    target.Completion.ContinueWith(async t =>
    {
        if (queue.Any())
            await source.SendAsync(queue.ToArray());

        source.Complete();
    });

    return DataflowBlock.Encapsulate(target, source);
}

The condition parameter may be simpler in your case. I needed to look at the queue as well as the current item to make the determination whether to create a new batch.

I used it like this:

public async Task RunExampleAsync<T>()
{
    var conditionalBatchBlock = CreateConditionalBatchBlock<T>(100, (queue, currentItem) => ShouldCreateNewBatch(queue, currentItem));

    var actionBlock = new ActionBlock<T[]>(async x => await PerformActionAsync(x));

    conditionalBatchBlock.LinkTo(actionBlock, new DataflowLinkOptions { PropagateCompletion = true });

    await ReadDataAsync<T>(conditionalBatchBlock);

    await actionBlock.Completion;
}
Loren Paulsen
  • 8,960
  • 1
  • 28
  • 38
0

Here is a specialized version of Loren Paulsen's CreateConditionalBatchBlock method. This one accepts a Func<TItem, TKey> keySelector argument, and emits a new batch every time an item with different key is received.

public static IPropagatorBlock<TItem, TItem[]> CreateConditionalBatchBlock<TItem, TKey>(
    Func<TItem, TKey> keySelector,
    DataflowBlockOptions dataflowBlockOptions = null,
    int maxBatchSize = DataflowBlockOptions.Unbounded,
    IEqualityComparer<TKey> keyComparer = null)
{
    if (keySelector == null) throw new ArgumentNullException(nameof(keySelector));
    if (maxBatchSize < 1 && maxBatchSize != DataflowBlockOptions.Unbounded)
        throw new ArgumentOutOfRangeException(nameof(maxBatchSize));

    keyComparer = keyComparer ?? EqualityComparer<TKey>.Default;
    var options = new ExecutionDataflowBlockOptions();
    if (dataflowBlockOptions != null)
    {
        options.BoundedCapacity = dataflowBlockOptions.BoundedCapacity;
        options.CancellationToken = dataflowBlockOptions.CancellationToken;
        options.MaxMessagesPerTask = dataflowBlockOptions.MaxMessagesPerTask;
        options.TaskScheduler = dataflowBlockOptions.TaskScheduler;
    }

    var output = new BufferBlock<TItem[]>(options);

    var queue = new Queue<TItem>(); // Synchronization is not needed
    TKey previousKey = default;

    var input = new ActionBlock<TItem>(async item =>
    {
        var key = keySelector(item);
        if (queue.Count > 0 && !keyComparer.Equals(key, previousKey))
        {
            await output.SendAsync(queue.ToArray()).ConfigureAwait(false);
            queue.Clear();
        }
        queue.Enqueue(item);
        previousKey = key;

        if (queue.Count == maxBatchSize)
        {
            await output.SendAsync(queue.ToArray()).ConfigureAwait(false);
            queue.Clear();
        }
    }, options);

    _ = input.Completion.ContinueWith(async t =>
    {
        if (queue.Count > 0)
        {
            await output.SendAsync(queue.ToArray()).ConfigureAwait(false);
            queue.Clear();
        }
        if (t.IsFaulted)
        {
            ((IDataflowBlock)output).Fault(t.Exception.InnerException);
        }
        else
        {
            output.Complete();
        }
    }, TaskScheduler.Default);

    return DataflowBlock.Encapsulate(input, output);
}
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • A "heavier" custom `BatchBlock` implementation can be found [here](https://stackoverflow.com/questions/32717337/data-propagation-in-tpl-dataflow-pipeline-with-batchblock-triggerbatch/62609868#62609868). – Theodor Zoulias Jun 27 '20 at 12:20
  • Regarding "Synchronization is not needed" comment. ActionBlock is intentionally created with MaxDegreeOfParallelism = 1? Thats why the synchronization is not needed? – ben92 Aug 24 '21 at 05:06
  • @ben92 yeap, exactly. – Theodor Zoulias Aug 24 '21 at 06:50