2

I have two TransformBlocks which are arranged in a loop. They link their data to each other. TransformBlock 1 is an I/O block reading data and is limited to a maximum of 50 tasks. It reads the data and some meta data. Then they are passed to the second block. The second block decides on the meta data if the message goes again to the first block. So after the meta data matches the criteria and a short wait the data should go again back again to the I/O block. The second blocks MaxDegreeOfParallelism can be unlimited.

Now I have noticed when I send a lot of data to the I/O block it takes a long time till the messages are linked to the second block. It takes like 10 minutes to link the data and they are all sent in a bunch. Like 1000 entries in a few seconds. Normally I would implement it like so:

public void Start()
{
    _ioBlock = new TransformBlock<Data,Tuple<Data, MetaData>>(async data =>
    {
        var metaData = await ReadAsync(data).ConfigureAwait(false);

        return new Tuple<Data, MetaData>(data, metaData);

    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });

    _waitBlock = new TransformBlock<Tuple<Data, MetaData>,Data>(async dataMetaData =>
    {
        var data = dataMetaData.Item1;
        var metaData = dataMetaData.Item2;

        if (!metaData.Repost)
        {
            return null;
        }

        await Task.Delay(TimeSpan.FromMinutes(1)).ConfigureAwait(false);

        return data;

    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });

    _ioBlock.LinkTo(_waitBlock);
    _waitBlock.LinkTo(_ioBlock, data => data != null);
    _waitBlock.LinkTo(DataflowBlock.NullTarget<Data>());

    foreach (var data in Enumerable.Range(0, 2000).Select(i => new Data(i)))
    {
        _ioBlock.Post(data);
    }
}

But because of the described problem I have to implement it like so:

public void Start()
{
    _ioBlock = new ActionBlock<Data>(async data =>
    {
        var metaData = await ReadAsync(data).ConfigureAwait(false);

        var dataMetaData= new Tuple<Data, MetaData>(data, metaData);

        _waitBlock.Post(dataMetaData);

    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });

    _waitBlock = new ActionBlock<Tuple<Data, MetaData>>(async dataMetaData =>
    {
        var data = dataMetaData.Item1;
        var metaData = dataMetaData.Item2;

        if (metaData.Repost)
        {
            await Task.Delay(TimeSpan.FromMinutes(1)).ConfigureAwait(false);

            _ioBlock.Post(data);
        }
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });

    foreach (var data in Enumerable.Range(0, 2000).Select(i => new Data(i)))
    {
        _ioBlock.Post(data);
    }
}

When I use the second approach the data get linked/posted faster (one by one). But it feels more like a hack to me. Anybody know how to fix the problem? Some friends recommended me to use TPL Pipeline but it seems much more complicated to me.

BlackMatrix
  • 474
  • 5
  • 19
  • 1
    What's the deal with the **one minute** wait in your code?! –  Mar 20 '18 at 23:29
  • Sometimes blocks will wait until all information has flowed from one block to another. You can tell a block to act on data immediately. Try setting **BoundedCapacity** to **5** or similar. https://blog.stephencleary.com/2012/11/async-producerconsumer-queue-using.html –  Mar 20 '18 at 23:35
  • The deal with the **one minute** wait is to retry it after a while. It's just a demo project. I need the data out of the first TransformBlock because this block should be always busy. BounedCapacity would limit the Input/Output buffers and I would have to use SendAsync and probably additional task, isn't it? – BlackMatrix Mar 20 '18 at 23:42
  • 1
    Try it. I suspect it is batching and you've introduced a possible 1 minute delay per batch –  Mar 21 '18 at 01:58
  • 1
    _"..The second block decides on the meta data if the message goes again to the first block.."_ - your first example is doing that? Still don't understand why you are `sleeping`. It defeats the whole purpose of threading/concurrency/TPL Dataflow particularly when your problem is _"TPL Dataflow LinkTo TransformBlock is `very slow`"_. –  Mar 21 '18 at 02:15
  • The cycle is like that Block1 -> Block2 -> Block1 -> Block2 ... I have added log entries to see when the data enter and exit the blocks. The start: The input buffer of Block1 is filled and 50 items are processed concurrently. Everynthing is fine. As soon as the items are processed in Block1 they should be transfered to Block2. But this is not efficient. I get like 1000 log entries in a few seconds when the items enter Block2 even they finished Block1 in a timespan of 5-10 minutes. As I said my second approach is working fine. Maybe you see the difference? – BlackMatrix Mar 21 '18 at 10:06
  • 1
    `LinkTo` doesn't do anything at runtime. It doesn't *pass* anything to the next block, it sets up a connection so that one block will post results to the next. If something is slow, it's the code in the blocks – Panagiotis Kanavos Mar 21 '18 at 13:23
  • 1
    @BlackMatrix that `Task.Delay(TimeSpan.FromMinutes(1))` is an obvious reason for slow performance and reason enough to close this question as not reproducible. The two snippets are completely different - the second one has no delays – Panagiotis Kanavos Mar 21 '18 at 13:25
  • 1
    @BlackMatrix as for "deciding" that's not the job of the *blocks*. `LinkTo` accepts a `Predicate` argument that decides whether a message should be passed to the linked block or not. You can have multiple `LinkTo` calls linking one block to many other with conditions. Just make sure the conditions allow *all* messages to be routed in the end, otherwise you'll end up with messages stuck in the output buffer – Panagiotis Kanavos Mar 21 '18 at 13:28
  • 1
    @BlackMatrix finally, blocks aren't *queues*. Having cycles in the mesh makes completion very tricky - is `_ioBlock` going to call complete on itself? What if you tell it to complete and it produces *more* messages to retry? It's better to retry inside the block itself up to eg 3 times. – Panagiotis Kanavos Mar 21 '18 at 13:31
  • The blocks are completly different? Both are waiting a timespan till they get posted again to the I/O block. Making a retry inside of the I/O block doesn't make any sense because the retry wouldn't help eigther. Don't stuck with the **one minute**, in real the Timespan is 15-60 minutes. Also if I would do the retry inside of the I/O block the items in the InputBuffer wouldn't be processed because only 50 need to be processed in that block. So I need to get the data out of that block. The blocks never end. Some extra data items are posted from outside to I/O block. – BlackMatrix Mar 21 '18 at 14:03
  • Maybe I'm getting something wrong but when does the data go from one block to the next when linked? They wait till all 50 data items are finished and then passed to block2? Look at this: https://pastebin.com/i3QLSc6Z I posted 10000 elements to the I/O block and look at the "Enter wait block.." bunch at 2018-03-21 15:37:56.1256 I used the following code for testing: https://pastebin.com/zRYYWDGx – BlackMatrix Mar 21 '18 at 14:43
  • exit i/o block 29 till enter wait block 29 takes almost 1 minute even if there's no real CPU load (just waiting). Wait block has unlimited capacity so this should execute faster (like it does with my second approach and using `_ioBlock.Post` inside of `_waitBlock`) – BlackMatrix Mar 21 '18 at 15:14

1 Answers1

2

Problem solved. You need to set

ExecutionDataflowBlockOptions.EnsureOrdered

to forward the data immediately to the next/wait block.

Further information:

Why do blocks run in this order?

BlackMatrix
  • 474
  • 5
  • 19