4

Scope:

  • I want to process a large file (1 GB+) by splitting it into smaller (manageable) chunks (partitions), persist them on some storage infrastructure (local disk, blob, network, etc.) and process them one by one, in memory.
  • I want to achieve this by leveraging the TPL Dataflow library and I've created several processing blocks, each of them performing a specific action, on a in-memory file partition.
  • Further on, I'm using a SemaphoreSlim object to limit to max number of in-memory partitions being processed at a given time, until it is loaded and fully processed.
  • I'm also using the MaxDegreeOfParallelism configuration attribute at block level to limit the degree of parallelism for each block.

From a technical perspective, the scope is to limit the processing of multiple partitions in parallel, across several continuous pipeline steps, by using a Semaphore, thus avoiding overloading the memory.

Issue description: When MaxDegreeOfParallelism is set to a value greater than 1 for all Dataflow blocks except the first one, the process hangs and seems that it reaches a deadlock. When MaxDegreeOfParallelism is set to 1, everything works as expected. Code sample below...

Do you have any idea/hint/tip why this happens?

using System;
using System.Collections.Generic;
using System.IO;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;

namespace DemoConsole
{
    class Program
    {
        private static readonly SemaphoreSlim _localSemaphore = new(1);

        static async Task Main(string[] args)
        {
            Console.WriteLine("Configuring pipeline...");

            var dataflowLinkOptions = new DataflowLinkOptions() { PropagateCompletion = true };

            var filter1 = new TransformManyBlock<string, PartitionInfo>(CreatePartitionsAsync, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });

            // when MaxDegreeOfParallelism on the below line is set to 1, everything works as expected; any value greater than 1 causes issues              
            var blockOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 };

            var filter2 = new TransformBlock<PartitionInfo, PartitionInfo>(ReadPartitionAsync, blockOptions);
            var filter3 = new TransformBlock<PartitionInfo, PartitionInfo>(MapPartitionAsync, blockOptions);
            var filter4 = new TransformBlock<PartitionInfo, PartitionInfo>(ValidatePartitionAsync, blockOptions);

            var actionBlock = new ActionBlock<PartitionInfo>(async (x) => { await Task.CompletedTask; });

            filter1.LinkTo(filter2, dataflowLinkOptions);
            filter2.LinkTo(filter3, dataflowLinkOptions);
            filter3.LinkTo(filter4, dataflowLinkOptions);
            filter4.LinkTo(actionBlock, dataflowLinkOptions);

            await filter1.SendAsync("my-file.csv");

            filter1.Complete();

            await actionBlock.Completion;

            Console.WriteLine("Pipeline completed.");
            Console.ReadKey();
            Console.WriteLine("Done");
        }

        private static async Task<IEnumerable<PartitionInfo>> CreatePartitionsAsync(string input)
        {
            var partitions = new List<PartitionInfo>();
            const int noOfPartitions = 10;

            Log($"Creating {noOfPartitions} partitions from raw file on Thread [{Thread.CurrentThread.ManagedThreadId}] ...");

            for (short i = 1; i <= noOfPartitions; i++)
            {
                partitions.Add(new PartitionInfo { FileName = $"{Path.GetFileNameWithoutExtension(input)}-p{i}-raw.json", Current = i });
            }

            await Task.CompletedTask;

            Log($"Creating {noOfPartitions} partitions from raw file completed on Thread [{Thread.CurrentThread.ManagedThreadId}].");

            return partitions;
        }

        private static async Task<PartitionInfo> ReadPartitionAsync(PartitionInfo input)
        {
            Log($"Sempahore - trying to enter for partition [{input.Current}] - Current count is [{_localSemaphore.CurrentCount}]; client thread [{Thread.CurrentThread.ManagedThreadId}]");
            await _localSemaphore.WaitAsync();
            Log($"Sempahore - entered for partition [{input.Current}] - Current count is [{_localSemaphore.CurrentCount}]; client thread [{Thread.CurrentThread.ManagedThreadId}]");

            Log($"Reading partition [{input.Current}] on Thread [{Thread.CurrentThread.ManagedThreadId}] ...");
            await Task.Delay(1000);
            Log($"Reading partition [{input.Current}] completed on Thread [{Thread.CurrentThread.ManagedThreadId}].");

            return input;
        }

        private static async Task<PartitionInfo> MapPartitionAsync(PartitionInfo input)
        {
            Log($"Mapping partition [{input.Current}] on Thread [{Thread.CurrentThread.ManagedThreadId}] ...");
            await Task.Delay(1000);
            Log($"Mapping partition [{input.Current}] completed on Thread [{Thread.CurrentThread.ManagedThreadId}].");

            return input;
        }

        private static async Task<PartitionInfo> ValidatePartitionAsync(PartitionInfo input)
        {
            Log($"Validating partition [{input.Current}] on Thread [{Thread.CurrentThread.ManagedThreadId}] ...");
            await Task.Delay(1000);
            Log($"Validating partition [{input.Current}] completed on Thread [{Thread.CurrentThread.ManagedThreadId}].");

            Log($"Sempahore - releasing - Current count is [{_localSemaphore.CurrentCount}]; client thread [{Thread.CurrentThread.ManagedThreadId}]");
            _localSemaphore.Release();
            Log($"Sempahore - released - Current count is [{_localSemaphore.CurrentCount}]; client thread [{Thread.CurrentThread.ManagedThreadId}]");

            return input;
        }

        private static void Log(string message) => Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} : {message}");
    }

    class PartitionInfo
    {
        public string FileName { get; set; }
        public short Current { get; set; }
    }
}
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • Does this happen also if you remove the initial `TransformManyBlock`, and feed `PartitionInfo`s manually to the `filter2` block? – Theodor Zoulias Jun 14 '21 at 15:44
  • 3
    Have you considered using [BoundedCapacity](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.dataflow.dataflowblockoptions.boundedcapacity?view=net-5.0#System_Threading_Tasks_Dataflow_DataflowBlockOptions_BoundedCapacity) block option? (_Instead_ of SemaphoreSlim) – Fildor Jun 14 '21 at 15:48
  • `await Task.CompletedTask;` - what do think this does? Why not `return Task.FromResult(partitions);`? – Fildor Jun 14 '21 at 15:52
  • @Fildor the `await Task.CompletedTask;` is probably there to suppress the compiler warning CS1998 "This async method lacks 'await' operators and will run synchronously". I don't think that it's relevant to the deadlock problem. – Theodor Zoulias Jun 14 '21 at 15:56
  • @TheodorZoulias That's what I was suspecting. But it's indeed unrelated. – Fildor Jun 14 '21 at 15:58
  • 1
    Why use SemaphoreSlim at all? Dataflow already supports throttling and backpressure through the MaxDegreeOfParallelism and BoundedCapacity settings. `I'm using a SemaphoreSlim object to limit to max number of in-memory partitions being processed at a given time` that's what MaxDegreeOfParallelism is for. – Panagiotis Kanavos Jun 14 '21 at 16:01
  • Please explain the *actual* problem, not how you think it can be solved. What kind of partitioning do you want and *why*? To limit the in-memory data all you need is to set the `BoundedCapacity` value. Do you need to process data in fixed-size,ordered batches? That's what' `BatchBlock` does. Do you need to group and aggregate? That's more interesting, and the solution depends on the actual problem – Panagiotis Kanavos Jun 14 '21 at 16:07
  • @PanagiotisKanavos MaxDegreeOfParallelism does not fully help because it can only be set at block level. Let's imagine the following scenario: an initial file having the size 1GB is split in the first TransformManyBlock in 10 partitions of ~100 MB each and persisted on disk. Then the following blocks will process a given partition. Even with MaxDegreeOfParallelism set to 1 for each block, at a given moment in time filter2 will process partition1, filter3 will process partition2, filter4 will process partition3 and so on. – Bogdan Rotaru Jun 14 '21 at 20:05
  • @PanagiotisKanavos The scope is to process partition1 in filter2 and move it to filter3 and so on until it reaches the last filter. Only after that the next partition should start to be processed. – Bogdan Rotaru Jun 14 '21 at 20:07
  • @Fildor Currently, BoundedCapacity for all blocks is set to default (unlimited). I'm not sure if it would help... – Bogdan Rotaru Jun 14 '21 at 20:10
  • @Fildor - The code sample is simplified for the sake of demonstrating the issue. The actual code contains processing logic instead of "await Task.Delay(1000);" or "await Task.CompletedTask" instructions. – Bogdan Rotaru Jun 14 '21 at 20:12
  • 1
    Using `SemaphoreSlim` with Dataflow is unusual. To coordinate across multiple blocks, check out `ConcurrentExclusiveTaskScheduler`. – Stephen Cleary Jun 15 '21 at 02:45
  • @BogdanRotaru change BoundedCapacity. If you set eg BoundedCapacity to 1 for every block there would be only 1 item in memory per block at a time. What you're doing right now is caching *everything* in memory, whether you need it or not. – Panagiotis Kanavos Jun 15 '21 at 06:29
  • @BogdanRotaru again, what's the *real* problem? Until now you're asking how to fix the attempted solution, using external means, when the Dataflow itself can easily handle the job. 1MB or 100 GB it doesn't matter, if you only have 1 line per block at a time – Panagiotis Kanavos Jun 15 '21 at 06:30
  • 1
    @BogdanRotaru for example, I have to process 50K-100K air tickets every 15 minutes, and request detailed ticket records from airline, parse the responses and insert them to the database. If I used unbounded blocks, there would be 100K requests waiting in the ticket record block waiting the slow download process to complete. The XML responses are *big*, which would waste a lot of RAM. By setting BoundedCapacity to 8 though, I keep only 8 records in flight. By setting MaxDOP to 8, I only make 8 concurrent requests. At the end, a BatchBlock batches results to insert in the database – Panagiotis Kanavos Jun 15 '21 at 06:36
  • @StephenCleary The `ConcurrentExclusiveTaskScheduler` task scheduler provides exclusive access with regards to other tasks on the same pair. It does not provide a mechanism to enter and exit a sequence of tasks and allow the next item to enter only when previous one has left the exclusive zone. – Bogdan Rotaru Jun 15 '21 at 13:44
  • @BogdanRotaru: Sure it does; it's the `Concurrent` part of `ConcurrentExclusiveSchedulerPair`. – Stephen Cleary Jun 15 '21 at 13:46
  • @PanagiotisKanavos Regarding your previous comments "_If you set eg BoundedCapacity to 1 for every block there would be only 1 item in memory per block at a time_" - my requirement is to have an overall bounded capacity set to 1 so that a sequence of dataflow blocks (`filter2`, `filter3` and `filter4`) will process only 1 item at a time. – Bogdan Rotaru Jun 15 '21 at 13:49
  • @PanagiotisKanavos When `filter2` finishes it sends the item to `filter3` but without picking up the next one. Same for `filter3`: when it finishes will send the item to `filter4`. Only when `filter4` completes, `filter2` should pick the next item. Hope that this is more clearer now. – Bogdan Rotaru Jun 15 '21 at 13:51
  • 1
    Not really. You still describe the attempted solution. If you only want one item to be processed at a time, why use Dataflow at all? That's no different than asking how to create a pipeline of bash commands but force the tools to process only one item at a time. Well, don't use a pipeline in that case. Or execute the pipeline inside a loop. On other hand, what's wrong with having 5 blocks processing one line at a time? No matter how big the file is, 10MB or 10TB, only 5 lines will be in memory at a time. Plus the stream's buffer, which typically is 4KB or 8KB – Panagiotis Kanavos Jun 15 '21 at 13:53
  • What does the *real* code do? What's behind `Task.Delay(1000)`? That's what matters. This isn't about Dataflow quirks or bugs, but what kind of architecture is suitable for your problem. – Panagiotis Kanavos Jun 15 '21 at 14:02

1 Answers1

2

Before implementing this solution take a look at the comments because there is a fundamental architecture problem in your code.

However, the issue you've posted is reproducible and can be solved with the following ExecutionDataflowBlockOption change:

new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5, EnsureOrdered = false });

The EnsureOrdered property defaults to true. When parallelism > 1, there's no guarantee which message will be processed first. If the message processed first was not the first one received by the block, it will wait in a reordering buffer until the first message it received completes. Because filter1 is a TransformManyBlock, I'm not sure it's even possible to know what order each message is sent to filter2 in.

If you run your code enough times you will eventually you get lucky, and the first message sent to filter2 also gets processed first, in which case it will release the semaphore and progress. But you will have the same issue on the very next message processed; if it wasn't the second message received, it will wait in the reordering buffer.

SeanOB
  • 752
  • 5
  • 24