2

I have an IAsyncEnumerable<string> stream that contains data downloaded from the web, and I want to save asynchronously each piece of data in a SQL database. So I used the ForEachAwaitAsync extension method from the System.Linq.Async library. My problem is that downloading and saving each piece of data is happening sequentially, while I would prefer if it happened concurrently.

To clarify, I don't want to download more than one pieces of data at the same time, neither I want to save more than one pieces of data at the same time. What I want is that while I am saving a piece of data in the database, the next piece of data should be concurrently downloaded from the web.

Below is a minimal (contrived) example of my current solution. Five items are downloaded and then are saved in the database. Downloading each item takes 1 second, and saving it takes another 1 second:

async IAsyncEnumerable<string> GetDataFromWeb()
{
    foreach (var item in Enumerable.Range(1, 5))
    {
        Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Downloading #{item}");
        await Task.Delay(1000); // Simulate an I/O-bound operation
        yield return item.ToString();
    }
}

var stopwatch = Stopwatch.StartNew();
await GetDataFromWeb().ForEachAwaitAsync(async item =>
{
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Saving #{item}");
    await Task.Delay(1000); // Simulate an I/O-bound operation
});
Console.WriteLine($"Duration: {stopwatch.ElapsedMilliseconds:#,0} msec");

The code is working, but not in the way I want. The total duration is ~10 seconds, instead of the desirable ~6 seconds.

Actual undesirable output:

04:55:50.526 > Downloading #1
04:55:51.595 > Saving #1
04:55:52.598 > Downloading #2
04:55:53.609 > Saving #2
04:55:54.615 > Downloading #3
04:55:55.616 > Saving #3
04:55:56.617 > Downloading #4
04:55:57.619 > Saving #4
04:55:58.621 > Downloading #5
04:55:59.622 > Saving #5
Duration: 10,115 msec

Hypothetical desirable output:

04:55:50.000 > Downloading #1
04:55:51.000 > Saving #1
04:55:51.000 > Downloading #2
04:55:52.000 > Saving #2
04:55:52.000 > Downloading #3
04:55:53.000 > Saving #3
04:55:53.000 > Downloading #4
04:55:54.000 > Saving #4
04:55:54.000 > Downloading #5
04:55:55.000 > Saving #5
Duration: 6,000 msec

I am thinking about implementing a custom extension method named ForEachConcurrentAsync, having identical signature with the aforementioned ForEachAwaitAsync method, but with behavior that allows enumerating and acting on items to occur concurrently. Below is a stub of this method:

/// <summary>
/// Invokes and awaits an asynchronous action on each element in the source sequence.
/// Each action is awaited concurrently with fetching the sequence's next element.
/// </summary>
public static Task ForEachConcurrentAsync<T>(
    this IAsyncEnumerable<T> source,
    Func<T, Task> action,
    CancellationToken cancellationToken = default)
{
    // What to do?
}

How could this functionality be implemented?

Additional requirements:

  1. Leaking running tasks in case of cancellation or failure is not acceptable. All started tasks should be completed when the method completes.
  2. In the extreme case that both the enumeration and an action fails, only one of the two exceptions should be propagated, and either one is OK.
  3. The method should be genuinely asynchronous, and should not block the current thread (unless the action parameter contains blocking code, but this is a responsibility of the caller to prevent).

Clarifications:

  1. In case saving the data takes longer than downloading them from the web, the method should not keep downloading more items in advance. Only one piece of data should be downloaded in advance at maximum, while the previous one is saved.

  2. The IAsyncEnumerable<string> with the web data is the starting point of this problem. I don't want to change the generator method of the IAsyncEnumerable<string>. I want to act on its elements (by saving them into the database), while the enumerable is enumerated.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • You'd need to collect the Tasks from the `action` calls and do a `Task.WhenAll` at the end. – juharr Feb 14 '21 at 04:08
  • @juharr ideally I would like to avoid keeping track of all tasks during the enumeration. An `IAsyncEnumerable` could theoretically emit infinite elements. – Theodor Zoulias Feb 14 '21 at 04:12
  • If you truly get an infinite collection of items then the code will never finish, but to handle that you'd just need a buffer and once it hits a limit you could do `WhenAny` and then remove completed tasks. Because you could have all the downloads finish before the first save so there's no way to continue iterating the collection without somehow keeping track of the tasks unless you want to fire and forget them. – juharr Feb 14 '21 at 04:16
  • @juharr yeap, maintaining a limited buffer of tasks is certainly a possibility. I don't know though how this will help me to achieve the behavior I want. Regarding *"have all the downloads finish before the first save"*, this is not the functionality I want. Only one piece of data should be downloaded at maximum, while the previous is saved. – Theodor Zoulias Feb 14 '21 at 04:23
  • I'm theoretically saying that the time to get an item from the collection might be much less than the time for the action in which case I assume you want to get the items in sequence and not in parallel, but you want the actions to not delay getting the next item. If you're saying you want to wait to download the 2nd item until after the first one is saved then your expectation of how long it will take is completely wrong. – juharr Feb 14 '21 at 04:28
  • @juharr I added a clarification in my question, because this point was ill defined indeed. – Theodor Zoulias Feb 14 '21 at 04:30
  • Or are you saying you want the 1st save and 2nd download in parallel, but the 2nd save shouldn't start until the 1st is saved, and of course the 2nd is downloaded? – juharr Feb 14 '21 at 04:31
  • @juharr that's exactly what I want. 1st save and 2nd download in parallel, 2st save and 3rd download in parallel, 3rd save and 4th download in parallel etc. – Theodor Zoulias Feb 14 '21 at 04:33
  • In that case you just need to keep track of the task from the previous action in the iteration and await that before you do the next action. – juharr Feb 14 '21 at 04:41
  • @juharr yes, I think that this idea is in the right direction for solving this problem! – Theodor Zoulias Feb 14 '21 at 04:49
  • @TheodorZoulias is it possible to use a materialized list/array instead of the sequence `Enumerable.Range(1, 5)` in GetDataFromWeb? – stefan Feb 14 '21 at 14:11
  • @stefan in reality I have a finite number of data to download from the web, but ideally the `ForEachConcurrentAsync` should be able to handle an `IAsyncEnumerable` having infinite number of items, without consuming an infinite amount of RAM. So changing the `Enumerable.Range` to a `List` in order to solve the problem would probably make the solution to have undesirable properties. But I could still accept it if no better solution was proposed. – Theodor Zoulias Feb 14 '21 at 17:20

3 Answers3

2

It sounds like you just need to keep track of the previous action's Task and await it before the next action Task.

public static async Task ForEachConcurrentAsync<T>(
    this IAsyncEnumerable<T> source,
    Func<T, Task> action,
    CancellationToken cancellationToken = default)
{
    Task previous = null;
    try
    {
        await source.ForEachAwaitAsync(async item =>
        {
            if(previous != null)
            {
                await previous;
            }

            previous = action(item);
        });
    }
    finally
    {
        if(previous != null)
        {
            await previous;
        }
    }
}

All that's left is to sprinkle in the cancellation code.

juharr
  • 31,741
  • 4
  • 58
  • 93
  • Thanks juharr for the answer. It covers the basic functionality of the problem very well! I cannot accept it though, because it doesn't satisfy the first additional requirement of the question. In case the `source IAsyncEnumerable` fails, a running task could be left behind, running unobserved in a fire-and-forget fashion. – Theodor Zoulias Feb 14 '21 at 05:04
  • I've added a try finally to handle awaiting the previous task in the case of an error. – juharr Feb 14 '21 at 14:22
  • Yeap, now it works perfectly according to the requirements! I would prefer it to not have a dependency on the `System.Linq.Async` library, and also for the `CancellationToken` functionality to be included, but since the requirements have been satisfied I am accepting the answer. – Theodor Zoulias Feb 14 '21 at 17:11
1

Here is my solution.
I had to change the sequence to an array to access the next element.
Not sure if it fits your requirements to populate an array.

The idea is to start downloading the next item before returning the current.

    private static async Task Main(string[] args)
    {
        var stopwatch = Stopwatch.StartNew();
        await foreach (var item in GetDataFromWebAsync())
        {
            Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Saving #{item}");
            await Task.Delay(1000); // Simulate an I/O-bound operation

        }

        Console.WriteLine($"Duration: {stopwatch.ElapsedMilliseconds:#,0} msec");
    }

    private static async IAsyncEnumerable<string> GetDataFromWebAsync()
    {
        var items = Enumerable
            .Range(1, 5)
            .Select(x => x.ToString())
            .ToArray();

        Task<string> next = null;

        for (var i = 0; i < items.Length; i++)
        {
            var current = next is null 
                ? await DownloadItemAsync(items[i]) 
                : await next;

            var nextIndex = i + 1;
            next = StarNextDownloadAsync(items, nextIndex);
            
            yield return current;
        }
    }

    private static async Task<string> StarNextDownloadAsync(IReadOnlyList<string> items, int nextIndex)
    {
        return nextIndex < items.Count
            ? await DownloadItemAsync(items[nextIndex])
            : null;
    }

    private static async Task<string> DownloadItemAsync(string item)
    {
        Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} > Downloading #{item}");
        await Task.Delay(1000);
        return item;
    }

Console Output:

15:57:26.226 > Downloading #1
15:57:27.301 > Downloading #2
15:57:27.302 > Saving #1
15:57:28.306 > Downloading #3
15:57:28.307 > Saving #2
15:57:29.312 > Downloading #4
15:57:29.340 > Saving #3
15:57:30.344 > Downloading #5
15:57:30.347 > Saving #4
15:57:31.359 > Saving #5
Duration: 6 174 msec
stefan
  • 121
  • 1
  • 2
  • 6
  • 1
    Thanks Stefan for the answer. It seems to work pretty well. What I don't like to this solution is that the logic is intercepted inside the iterator method of the `IAsyncEnumerable`. I don't really want to change this method. My ultimate goal is to have a generic `ForEachConcurrentAsync` method, that I can use to solve all kinds of problems that have an `IAsyncEnumerable` as a starting point. I've edited the question and added a clarification about that. I am upvoting your answer anyway, because it seems to solve this particular problem. – Theodor Zoulias Feb 14 '21 at 17:57
1

Here is a relatively simple implementation that does not depend on the System.Linq.Async package:

/// <summary>
/// Invokes and awaits an asynchronous action on each element in the source sequence.
/// Each action is awaited concurrently with fetching the sequence's next element.
/// </summary>
public static async Task ForEachConcurrentAsync<T>(
    this IAsyncEnumerable<T> source,
    Func<T, Task> action,
    CancellationToken cancellationToken = default)
{
    var enumerator = source.GetAsyncEnumerator(cancellationToken);
    await using (enumerator.ConfigureAwait(false))
    {
        if (!await enumerator.MoveNextAsync().ConfigureAwait(false)) return;
        while (true)
        {
            Task task = action(enumerator.Current);
            bool moved;
            try { moved = await enumerator.MoveNextAsync().ConfigureAwait(false); }
            finally { await task.ConfigureAwait(false); }
            if (!moved) break;
        }
    }
}

Instead of awaiting both concurrent tasks with a Task.WhenAll, a try/finally block is used for simplicity. The downside is that if both concurrent operations fail, the error of the MoveNextAsync will not be propagated.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104