3

I Have a scenario in which I am reading some data from a database. This data is returned in the form of IAsyncEnumerable<MyData>. After reading the data I want to send it to a consumer. This consumer is asynchronous. Right now my code looks something like this:

// C#
IAsyncEnumerable<MyData> enumerable = this.dataSource.Read(query);

await foreach (var data in enumerable) 
{
    await this.consumer.Write(data);
}

My problem with this is that while I am enumerating the database, I am holding a lock on the data. I don't want to hold this lock for longer than I need to.

In the event that the consumer is consuming data slower than the producer is producing it, is there any way I can eagerly read from the datasource without just calling ToList or ToListAsync. I want to avoid reading all the data into memory at once, which would cause the opposite problem if now the producer is slower than the consumer. It is ok if the lock on the database is not as short as possible, I want a configurable tradeoff between how much data is in memory at once, and how long we keep the enumeration running.

My thought is that there would be some way to use a queue or channel-like datastructure to act as a buffer between the producer and consumer.

In Golang I would do something like this:

// go
queue := make(chan MyData, BUFFER_SIZE)
go dataSource.Read(query, queue)

// Read sends data on the channel, closes it when done

for data := range queue {
    consumer.Write(data)
}

Is there any way to get similar behavior in C#?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Rafael
  • 182
  • 1
  • 9
  • "any way I can eagerly read from the datasource without just calling ToList" well, it sounds like you need to pull all the results into memory, which can be done with...... – gunr2171 Oct 25 '22 at 17:47
  • Let me clarify, I don't want read the whole thing into memory at once, this would have the opposite problem if now the producer was slower than the consumer. – Rafael Oct 25 '22 at 17:51
  • I don't know go, so I'm asking this: what happens in your go code if the producer is slower? Wouldn't the queue be quickly emptied and the loop would end before the producer has produced everything? The consumer would end up not consume everything. – Sweeper Oct 25 '22 at 18:03
  • 1
    You can use BlockingCollection with ConcurrentQueue and max size set (analog of buffer size you have in go). Producer will put items there, blocking if max size is reached, and consumer will, well, consume, blocking if queue is empty (waiting for the next item). Docs: https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.blockingcollection-1?view=net-6.0 – Evk Oct 25 '22 at 18:09
  • @Sweeper, the loop will block and wait for a new item until the producer closes the queue. – Rafael Oct 25 '22 at 18:12
  • 1
    You will just need to offload putting items into that blocking collection to run in parallel and not sequential like now (like you do with "go" in go example). – Evk Oct 25 '22 at 18:33
  • Are you in control of the `this.dataSource.Read` implementation, or it's implemented by Microsoft or a third party? – Theodor Zoulias Oct 25 '22 at 18:44

2 Answers2

3

Here is a more robust implementation of the ConsumeBuffered extension method in Rafael's answer. This one uses a Channel<T> as buffer, instead of a BlockingCollection<T>. The advantage is that the enumeration of the two sequences, the source and the buffered, does not block one thread each. Care has been taken to complete the enumeration of the source sequence, in case the enumeration of the buffered sequence is abandoned prematurely by the consumer downstream.

public static async IAsyncEnumerable<T> ConsumeBuffered<T>(
    this IAsyncEnumerable<T> source, int capacity,
    [EnumeratorCancellation] CancellationToken cancellationToken = default)
{
    ArgumentNullException.ThrowIfNull(source);
    Channel<T> channel = Channel.CreateBounded<T>(new BoundedChannelOptions(capacity)
    {
        SingleWriter = true,
        SingleReader = true,
    });
    using CancellationTokenSource completionCts = new();

    Task producer = Task.Run(async () =>
    {
        try
        {
            await foreach (T item in source.WithCancellation(completionCts.Token)
                .ConfigureAwait(false))
            {
                await channel.Writer.WriteAsync(item).ConfigureAwait(false);
            }
        }
        catch (ChannelClosedException) { } // Ignore
        finally { channel.Writer.TryComplete(); }
    });

    try
    {
        await foreach (T item in channel.Reader.ReadAllAsync(cancellationToken)
            .ConfigureAwait(false))
        {
            yield return item;
            cancellationToken.ThrowIfCancellationRequested();
        }
        await producer.ConfigureAwait(false); // Propagate possible source error
    }
    finally
    {
        // Prevent fire-and-forget in case the enumeration is abandoned
        if (!producer.IsCompleted)
        {
            completionCts.Cancel();
            channel.Writer.TryComplete();
            await Task.WhenAny(producer).ConfigureAwait(false);
        }
    }
}

Setting the SingleWriter and SingleReader options of the bounded channel is a bit academic, and could be omitted. Currently (.NET 6) there is only one bounded Channel<T> implementation in the System.Threading.Channels library, regardless of the supplied options. This implementation is based on a Deque<T> (internal .NET type similar with a Queue<T>) synchronized with a lock.

The channel is enumerated inside a try/finally block, because C# iterators execute the finally blocks as part of the Dispose/DisposeAsync method of the autogenerated IEnumerator<T>/IAsyncEnumerator<T>, when the enumeration is abandoned.

Note: In case the external CancellationToken is canceled, the cancellation is propagated as an OperationCanceledException, and all the buffered items are lost. In a producer-consumer scenario with multiple producers and consumers, this might be a problem. It is advised that the CancellationToken is used only for the purpose of destroying the entire processing pipeline, not for parts of it.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • I don't think it's possible to yield within a try-catch block... is that limited to just certain versions of C#? – Rafael Oct 27 '22 at 17:10
  • 1
    @Rafael if I remember correctly you can't `yield` from a `try` block that has a `catch` block. But `try`+`finally` is OK (without `catch`). – Theodor Zoulias Oct 27 '22 at 17:27
0

Thank you to @Evk for pointing me towards the BlockingCollection<T>, this is the solution I came up with. It allows me to eagerly produce from an IAsyncEnumerable even if the consumer can't keep up. It may also be possible to come up with a similar solution using System.Threading.Channels to mimic the Go example.

public static async IAsyncEnumerable<T> ConsumeBuffered<T>(this IAsyncEnumerable<T> enumerable, int? maxBuffer = null)
    where T: class
{
    using (BlockingCollection<T> queue = maxBuffer == null ? new BlockingCollection<T>() : new BlockingCollection<T>(maxBuffer.Value))
    {
        Task producer = Task.Run(
            async () =>
            {
                await foreach (T item in enumerable.ConfigureAwait(false))
                {
                    queue.Add(item);
                }

                queue.CompleteAdding();
            });

        while (true)
        {
            T next;
            try
            {
                next = queue.Take();
            }
            catch (InvalidOperationException _)
            {
                // thrown when we try to Take after last item
                break;
            }

            yield return next;
        }

        // this might not be needed, task must be done 
        // if we exited the loop
        await producer.ConfigureAwait(false);
    }
}

Probably needs some polishing and testing for edge cases, but seems to work in UT

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Rafael
  • 182
  • 1
  • 9
  • How does this help at solving your problem? If you set the `maxBuffer`, the `queue.Add` will block when the `BlockingCollection` is full, so the `await foreach` will get blocked, keeping the database locked, which is what you want to avoid. If you don't set the `maxBuffer`, the `await foreach` will run at max speed and will store all the data inside the `BlockingCollection`, which is a less efficient storage than a `List`, which is also something that you want to avoid. One way or another you end up in an undesirable situation. – Theodor Zoulias Oct 25 '22 at 21:21
  • @TheodorZoulias for my scenario I expect the consumer to be only slightly slower than the producer. You are correct that in the worst case, the unbounded queue will just fill up with the entire dataset, but as long as the consumer is able to consume a few elements faster than the producer can produce the entire dataset, we still prevent the queue containing the dataset whole at any given point. You make a very good point though, I will consider this further. – Rafael Oct 25 '22 at 21:27
  • 1
    Best case scenario for your solution is that the database will get locked for half the time, and then half the data will get buffered. So it's a compromise between locking the database until all the data are consumed, and buffering all the data in memory. The question gives no clue that such compromise is satisfactory. To make this answer a valid answer for your question, you should edit the question and describe the kind of compromise that you are looking for. – Theodor Zoulias Oct 25 '22 at 21:47
  • 1
    Appreciate the suggestion, I will edit the question to clarify. Thanks! – Rafael Oct 25 '22 at 21:57
  • Instead of `while (true) bext = queue.Take();` it's simpler to do `foreach (var next in queue.GetConsumingEnumerable())`. Also this solution will result in a memory leak in case the enumeration of the resulting `IAsyncEnumerable` is abandoned prematurely (after either a `break` or an exception), because the `Task producer` will never get completed, and the database will remain locked until the whole process terminates. – Theodor Zoulias Oct 25 '22 at 22:20