0

I need to process every record from an input file asynchronously using batch concept. Example - say my input file has 100 records and my batch size is set to 10, I need to process the records in 10 batches (10 records per batch) and all these batches should be processed asynchronously. The batch size 10 is not fixed and it might vary.

I have a read function which reads each record from the file and if 10 records are read, I need to call an async method which process these records using a task. But the actual thread should continue reading next set of records and fill the next set of batch (read next 10 records) and once they are read, call the same async method which process these records using another task and this should continue until all records are read.

Right now, I am able to read the record and fill in the batch and then process each batch one after the other, but I wanted to do it asynchronously.

I am providing a snippet of my code below:

public async Task ProcessRecordAsync(InputFile)

{

   public int recordCount = 0;
   List<Task> TaskList = new List<Task>();

   While (condition to check if records present)

   {
        Object getRecordVal = ReadInputRecord();
       if(++recordCount >= 10)

       {
       var LastTask = new Task(async () => await ProcessRecordAsync());
            LastTask.Start();
            TaskList.Add(LastTask);
       }
    }

    Task.WaitAll(TaskList.ToArray());

 }


ProcessRecordAsync() --> This is the function which processes the input record

I think I am going wrong somewhere in calling the task incorrectly. Once the batch is filled, I wanted to call the ProcessRecordAsync function using a task and the main thread should still continue to read the records and fill the next set of batches. With this code, I am getting an exception.

I am getting below error:

System.InvalidOperationException: Collection was modified; enumeration operation may not execute. at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource) at System.Collections.Generic.List`1.Enumerator.MoveNextRare()

Is this the right way handling multiple tasks?

  • 3
    The error message indicates that you were iterating over a list but modified that list during the iteration. It's not apparent from the code you posted where that might happen. – Eric J. Sep 12 '19 at 03:35
  • 4
    You might consider using the .Batch extension method of the MoreLinq package for the batching aspect. https://www.nuget.org/packages/MoreLinq – Eric J. Sep 12 '19 at 03:36
  • 5
    You might want to consider _TPL DataFlow_. It allows for division of work and pipelining but in a much easier way than with `async/await` and `Task` alone –  Sep 12 '19 at 03:43
  • Thank you so much for your help. I think the issue posted in the question has been resolved. I think I was mistakenly altering the list while iterating and that caused the issue. I am now facing a new issue and working on resolving this. I will open a up a new post and give detailed explanation about it if I am not able to resolve. – Sriram Chandramouli Sep 12 '19 at 20:22

2 Answers2

3

I haven't used MoreLinq nor the TPL DataFlow Library before as the others suggested...

If you wanted to stick with async/await something like this would get the job done (though there are likely some optimizations to be found):

    async Task Main()
    {
        await BatchProcessAsync(GetValues(), ProcessElementAsync);
    }

    public async Task BatchProcessAsync<T>(
        IEnumerable<T> elements,
        Func<T, Task> operationAsync,
        int batchSize = 10)
    {
        using (var en = elements.GetEnumerator())
        {
            var ops = new List<Task>();

            while (en.MoveNext())
            {
                ops.Add(operationAsync(en.Current));
                if (ops.Count == batchSize)
                {
                    await Task.WhenAll(ops);
                    ops.Clear();
                }
            }

            // process any remaining operations
            if (ops.Any()) { await Task.WhenAll(ops); }
        }
    }    

    public async Task ProcessElementAsync(string element)
    {
        Print($"Processing element: {element}...");
        await Task.Delay(300);
        Print($"Completed element: {element}.");

        void Print(string output)
            => Console.WriteLine($"[{DateTime.Now:s}] {output}");
    }

    public IEnumerable<string> GetValues(int maxValues = 100)
        => Enumerable.Range(1, maxValues).Select(i => $"Element #{i}");

EDIT After posting I re-read the original question and realize that this implementation assumes you have already read all the records and therefore misses on the part of having the main thread continuing reading records from the input file. However, it should not be difficult to apply the same batching technique to reading the records from the input file in batches and then feeding the records in to be batch processed.

coding.monkey
  • 206
  • 1
  • 6
  • 1
    Good answer, although your batching mechanism is quite inefficient. The source `IEnumerable` is enumerated using LINQ again and again. Imagine that it is the result of [`File.ReadLines`](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readlines) method, and you'll see why this is a problem... – Theodor Zoulias Sep 12 '19 at 05:57
  • 1
    @TheodorZoulias - Very good point... I've edited the example with a new version that I believe will resolve your concern. – coding.monkey Sep 12 '19 at 06:09
  • This is the right approach. Just call `Dispose` to the enumerator at the end to be absolutely perfect. :-) – Theodor Zoulias Sep 12 '19 at 06:14
  • Thank you so much for your help. I think the issue posted in the question has been resolved. I think I was mistakenly altering the list while iterating and that caused the issue. I am now facing a new issue and working on resolving this. I will open a up a new post and give detailed explanation about it if I am not able to resolve. – Sriram Chandramouli Sep 12 '19 at 20:23
3

What you are trying to implement is the producer-consumer pattern. The most familiar way to implement this pattern is by using the BlockingCollection class. There is not much to learn beyond the Add, CompleteAdding, and GetConsumingEnumerable methods. Although easy to learn, it is not the most powerful tool. It is not very efficient when dealing with small workloads, and by being blocking by nature it isn't good for scalability either. There is also no native support for batching. You must do everything yourself by managing lists or arrays. I have tried to make a chunky implementation, with mediocre success.

Recently I invested the time to learn the TPL Dataflow library, and I can clearly say that it is the right tool for the job. You need to define two blocks, one to do the batching (BatchBlock), and one to do the processing (ActionBlock). Then form a small pipeline by linking the blocks together, feed the data to the first block, and have them processed automatically in batches by the second block. Finally call Complete to the first block, and await for the Completion of the second block. The library creates and manages all the Tasks needed to do the job. The performance is optimal. Just the API is unfamiliar, not particularly intuitive, and a bit verbose when you have to configure each block by providing options to each constructor.

var batchBlock = new BatchBlock<string>(batchSize: 10);
var actionBlock = new ActionBlock<string[]>(batch =>
{
    // Do something with the batch. For example:
    Console.WriteLine("Processing batch");
    foreach (var line in batch)
    {
        Console.WriteLine(line);
    }
});
batchBlock.LinkTo(actionBlock,
    new DataflowLinkOptions() { PropagateCompletion = true });
foreach (var line in File.ReadLines(@".\..\..\_Data.txt"))
{
    await batchBlock.SendAsync(line).ConfigureAwait(false);
}
batchBlock.Complete();
await actionBlock.Completion.ConfigureAwait(false);
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • Note that the batch wont process until it reaches the limit. So you maybe stuck with a few stragglers. You can mix this with RX, and also i have a modified batch block that can fire on batch size or time threshold if there is work available. However you could just use a timer as well. – TheGeneral Sep 12 '19 at 07:47
  • @TheGeneral I am not sure about what case you have in mind. A file with 25 lines will be processed in 3 batches, comprised of 10, 10 and 5 lines respectively. The last batch will get the last lines remaining, typically less than 10. At the end no line will be left unprocessed. – Theodor Zoulias Sep 12 '19 at 09:48
  • 1
    Ahh yeah you are right, I thought this was perpetual, I should have looked at the code, the complete will do the trick – TheGeneral Sep 12 '19 at 10:19
  • Thank you so much for your help. I think the issue posted in the question has been resolved. I think I was mistakenly altering the list while iterating and that caused the issue. I am now facing a new issue and working on resolving this. I will open a up a new post and give detailed explanation about it if I am not able to resolve. – Sriram Chandramouli Sep 12 '19 at 20:22