C# Abortable Asynchronous Fifo Queue - leaking massive amounts of memory

Question

I need to process data from a producer in FIFO fashion with the ability to abort processing if the same producer produces a new bit of data.

So I implemented an abortable FIFO queue based on Stephen Cleary's AsyncCollection (called AsyncCollectionAbortableFifoQueuein my sample) and one on TPL's BufferBlock (BufferBlockAbortableAsyncFifoQueue in my sample). Here's the implementation based on AsyncCollection

public class AsyncCollectionAbortableFifoQueue<T> : IExecutableAsyncFifoQueue<T>
{
    private AsyncCollection<AsyncWorkItem<T>> taskQueue = new AsyncCollection<AsyncWorkItem<T>>();
    private readonly CancellationToken stopProcessingToken;

    public AsyncCollectionAbortableFifoQueue(CancellationToken cancelToken)
    {
        stopProcessingToken = cancelToken;
        _ = processQueuedItems();
    }

    public Task<T> EnqueueTask(Func<Task<T>> action, CancellationToken? cancelToken)
    {
        var tcs = new TaskCompletionSource<T>();
        var item = new AsyncWorkItem<T>(tcs, action, cancelToken);
        taskQueue.Add(item);
        return tcs.Task;
    }

    protected virtual async Task processQueuedItems()
    {
        while (!stopProcessingToken.IsCancellationRequested)
        {
            try
            {
                var item = await taskQueue.TakeAsync(stopProcessingToken).ConfigureAwait(false);
                if (item.CancelToken.HasValue && item.CancelToken.Value.IsCancellationRequested)
                    item.TaskSource.SetCanceled();
                else
                {
                    try
                    {
                        T result = await item.Action().ConfigureAwait(false);
                        item.TaskSource.SetResult(result);   // Indicate completion
                    }
                    catch (Exception ex)
                    {
                        if (ex is OperationCanceledException && ((OperationCanceledException)ex).CancellationToken == item.CancelToken)
                            item.TaskSource.SetCanceled();
                        item.TaskSource.SetException(ex);
                    }
                }
            }
            catch (Exception) { }
        }
    }
}

public interface IExecutableAsyncFifoQueue<T>
{
    Task<T> EnqueueTask(Func<Task<T>> action, CancellationToken? cancelToken);
}

processQueuedItems is the task that dequeues AsyncWorkItem's from the queue, and executes them unless cancellation has been requested.

The asynchronous action to execute gets wrapped into an AsyncWorkItem which looks like this

internal class AsyncWorkItem<T>
{
    public readonly TaskCompletionSource<T> TaskSource;
    public readonly Func<Task<T>> Action;
    public readonly CancellationToken? CancelToken;

    public AsyncWorkItem(TaskCompletionSource<T> taskSource, Func<Task<T>> action, CancellationToken? cancelToken)
    {
        TaskSource = taskSource;
        Action = action;
        CancelToken = cancelToken;
    }
}

Then there's a task looking and dequeueing items for processing and either processing them, or aborting if the CancellationToken has been triggered.

That all works just fine - data gets processed, and if a new piece of data is received, processing of the old is aborted. My problem now stems from these Queues leaking massive amounts of memory if I crank up the usage (producer producing a lot more than the consumer processes). Given it's abortable, the data that is not processed, should be discarded and eventually disappear from memory.

So let's look at how I'm using these queues. I have a 1:1 match of producer and consumer. Every consumer handles data of a single producer. Whenever I get a new data item, and it doesn't match the previous one, I catch the queue for the given producer (User.UserId) or create a new one (the 'executor' in the code snippet). Then I have a ConcurrentDictionary that holds a CancellationTokenSource per producer/consumer combo. If there's a previous CancellationTokenSource, I call Cancel on it and Dispose it 20 seconds later (immediate disposal would cause exceptions in the queue). I then enqueue processing of the new data. The queue returns me a task that I can await so I know when processing of the data is complete, and I then return the result.

Here's that in code

internal class SimpleLeakyConsumer
{
    private ConcurrentDictionary<string, IExecutableAsyncFifoQueue<bool>> groupStateChangeExecutors = new ConcurrentDictionary<string, IExecutableAsyncFifoQueue<bool>>();
    private readonly ConcurrentDictionary<string, CancellationTokenSource> userStateChangeAborters = new ConcurrentDictionary<string, CancellationTokenSource>();
    protected CancellationTokenSource serverShutDownSource;
    private readonly int operationDuration = 1000;

    internal SimpleLeakyConsumer(CancellationTokenSource serverShutDownSource, int operationDuration)
    {
        this.serverShutDownSource = serverShutDownSource;
        this.operationDuration = operationDuration * 1000; // convert from seconds to milliseconds
    }

    internal async Task<bool> ProcessStateChange(string userId)
    {
        var executor = groupStateChangeExecutors.GetOrAdd(userId, new AsyncCollectionAbortableFifoQueue<bool>(serverShutDownSource.Token));
        CancellationTokenSource oldSource = null;
        using (var cancelSource = userStateChangeAborters.AddOrUpdate(userId, new CancellationTokenSource(), (key, existingValue) =>
        {
            oldSource = existingValue;
            return new CancellationTokenSource();
        }))
        {
            if (oldSource != null && !oldSource.IsCancellationRequested)
            {
                oldSource.Cancel();
                _ = delayedDispose(oldSource);
            }
            try
            {
                var executionTask = executor.EnqueueTask(async () => { await Task.Delay(operationDuration, cancelSource.Token).ConfigureAwait(false); return true; }, cancelSource.Token);
                var result = await executionTask.ConfigureAwait(false);
                userStateChangeAborters.TryRemove(userId, out var aborter);
                return result;
            }
            catch (Exception e)
            {
                if (e is TaskCanceledException || e is OperationCanceledException)
                    return true;
                else
                {
                    userStateChangeAborters.TryRemove(userId, out var aborter);
                    return false;
                }
            }
        }
    }

    private async Task delayedDispose(CancellationTokenSource src)
    {
        try
        {
            await Task.Delay(20 * 1000).ConfigureAwait(false);
        }
        finally
        {
            try
            {
                src.Dispose();
            }
            catch (ObjectDisposedException) { }
        }
    }
}

In this sample implementation, all that is being done is wait, then return true.

To test this mechanism, I wrote the following Data producer class:

internal class SimpleProducer
{

    //variables defining the test
    readonly int nbOfusers = 10;
    readonly int minimumDelayBetweenTest = 1; // seconds
    readonly int maximumDelayBetweenTests = 6; // seconds
    readonly int operationDuration = 3; // number of seconds an operation takes in the tester

    private readonly Random rand;
    private List<User> users;
    private readonly SimpleLeakyConsumer consumer;

    protected CancellationTokenSource serverShutDownSource, testAbortSource;
    private CancellationToken internalToken = CancellationToken.None;

    internal SimpleProducer()
    {
        rand = new Random();
        testAbortSource = new CancellationTokenSource();
        serverShutDownSource = new CancellationTokenSource();
        generateTestObjects(nbOfusers, 0, false);
        consumer = new SimpleLeakyConsumer(serverShutDownSource, operationDuration);
    }

    internal void StartTests()
    {
        if (internalToken == CancellationToken.None || internalToken.IsCancellationRequested)
        {
            internalToken = testAbortSource.Token;
            foreach (var user in users)
                _ = setNewUserPresence(internalToken, user);
        }
    }

    internal void StopTests()
    {
        testAbortSource.Cancel();
        try
        {
            testAbortSource.Dispose();
        }
        catch (ObjectDisposedException) { }
        testAbortSource = new CancellationTokenSource();
    }

    internal void Shutdown()
    {
        serverShutDownSource.Cancel();
    }

    private async Task setNewUserPresence(CancellationToken token, User user)
    {
        while (!token.IsCancellationRequested)
        {
            var nextInterval = rand.Next(minimumDelayBetweenTest, maximumDelayBetweenTests);
            try
            {
                await Task.Delay(nextInterval * 1000, testAbortSource.Token).ConfigureAwait(false);
            }
            catch (TaskCanceledException)
            {
                break;
            }
            //now randomly generate a new state and submit it to the tester class
            UserState? status;
            var nbStates = Enum.GetValues(typeof(UserState)).Length;
            if (user.CurrentStatus == null)
            {
                var newInt = rand.Next(nbStates);
                status = (UserState)newInt;
            }
            else
            {
                do
                {
                    var newInt = rand.Next(nbStates);
                    status = (UserState)newInt;
                }
                while (status == user.CurrentStatus);
            }
            _ = sendUserStatus(user, status.Value);
        }
    }

    private async Task sendUserStatus(User user, UserState status)
    {
        await consumer.ProcessStateChange(user.UserId).ConfigureAwait(false);
    }

    private void generateTestObjects(int nbUsers, int nbTeams, bool addAllUsersToTeams = false)
    {
        users = new List<User>();
        for (int i = 0; i < nbUsers; i++)
        {
            var usr = new User
            {
                UserId = $"User_{i}",
                Groups = new List<Team>()
            };
            users.Add(usr);
        }
    }
}

It uses the variables at the beginning of the class to control the test. You can define the number of users (nbOfusers - every user is a producer that produces new data), the minimum (minimumDelayBetweenTest) and maximum (maximumDelayBetweenTests) delay between a user producing the next data and how long it takes the consumer to process the data (operationDuration).

StartTests starts the actual test, and StopTests stops the tests again.

I'm calling these as follows

static void Main(string[] args)
    {
        var tester = new SimpleProducer();
        Console.WriteLine("Test successfully started, type exit to stop");
        string str;
        do
        {
            str = Console.ReadLine();
            if (str == "start")
                tester.StartTests();
            else if (str == "stop")
                tester.StopTests();
        }
        while (str != "exit");
        tester.Shutdown();
    }

So, if I run my tester and type 'start', the Producer class starts producing states that are consumed by Consumer. And memory usage starts to grow and grow and grow. The sample is configured to the extreme, the real-life scenario I'm dealing with is less intensive, but one action of the producer could trigger multiple actions on the consumer side which also have to be executed in the same asynchronous abortable fifo fashion - so worst case, one set of data produced triggers an action for ~10 consumers (that last part I stripped out for brevity).

When I'm having a 100 producers, and each producer produces a new data item every 1-6 seconds (randomly, also the data produces is random). Consuming the data takes 3 seconds.. so there's plenty of cases where there's a new set of data before the old one has been properly processed.

Looking at two consecutive memory dumps, it's obvious where the memory usage is coming from.. it's all fragments that have to do with the queue. Given that I'm disposing every TaskCancellationSource and not keeping any references to the produced data (and the AsyncWorkItem they're put into), I'm at a loss to explain why this keeps eating up my memory and I'm hoping somebody else can show me the errors of my way. You can also abort testing by typing 'stop'.. you'll see that no longer is memory being eaten, but even if you pause and trigger GC, memory is not being freed either.

The source code of the project in runnable form is on Github. After starting it, you have to type start (plus enter) in the console to tell the producer to start producing data. And you can stop producing data by typing stop (plus enter)

It seems that you are not using the `CancellationToken`s to cancel running tasks. You only use them to avoid starting tasks that have not already been created and started. So you are not using the cancellation mechanism to its full potential, and you could have used instead something more lightweight and not disposable, like `volatile bool` wrappers. Do you use the `CancellationToken`s for having the option of introducing cancelable tasks in the future? — Theodor Zoulias, May 14 '20 at 18:27
Also it seems that what you need is an async queue that can hold at maximum one element at its buffer, and will replace the old buffered item with any new item it receives. If this is indeed the case, you could take a look at [bounded channels](https://learn.microsoft.com/en-us/dotnet/api/system.threading.channels.channel.createbounded), and the option [`BoundedChannelFullMode.DropOldest`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.channels.boundedchannelfullmode). — Theodor Zoulias, May 14 '20 at 18:34
@TheodorZoulias: What makes you think I'm not using the `CancellationToken` for cancelling running tasks? I'm sending the Token received from the `CancellationTokenSource`I'm creating for each task to the consuming method (`processUserStateUpdateAsync`). The `Task.Delay`on line 120 of `LeakyConsumer` gets aborted if a new state makes it down the pipeline. In my real-live usage of this approach, I make network requests using `HttpClient` and that same token is used to abort them (that's the reason why I'm using `CancellationTokens` - ops I need to cancel work on these tokens) — Stephan Steiner, May 15 '20 at 09:45
@TheodorZoulias: I just pushed a new version. It's now using a bounded channel as the queue. It still leaks massive amounts of memory though. the log it writes contains tons of `12:06:09.625 | INFO | Cancelling execution of processUseStateUpdateAsync for user User_0 because there's a new state: OnThePhone` Those are the indications that processing does indeed get aborted (at line 134 to be specific - that throws a `TaskCancelledException`when `Cancel`is called on the `CancellationTokenSource`) — Stephan Steiner, May 15 '20 at 10:20
Running source code from Github repo does not show any memory leaks. Should message `Starting test for {users.Count} users` be displayed on the start? I don't have it in console output. — cassandrad, May 15 '20 at 10:47
@cassandrad: you need to type 'start' in the console to actually tell the producer to start producing data. And then 'stop' to tell the producer to stop producing data again. It then takes a few seconds and then you'll see in the console that things start happening (every producer first sleeps for a random interval - the intervals are configurable in `Producer.cs`, so there's a wait of several seconds, and then you'll see action and memory increase) — Stephan Steiner, May 15 '20 at 11:37
I saw the `AsyncWorkItem` having a property of type `Func>`, and not of type `Func>`, and from this I assumed that the `CancellationToken` is not passed to the task factory. But OK, it could be still passed it in a deeper level. My mistake. — Theodor Zoulias, May 15 '20 at 13:20
Btw your project has too many lines of code, and digging into it in order to find the source of the memory leak is not a laughing matter. This is certainly not something I would be willing to do for free. IMHO this question is currently not suitable for StackOverflow, unless the code can be severely reduced to 100 lines maximum, so that it can be fully included in the question itself. — Theodor Zoulias, May 15 '20 at 14:50
@TheodorZoulias: I've compressed my code significantly so that the entire Fifo Queue implementation and how I'm using it. It's all in the original post. If you strip away the plumbing like `using` statements we're below 100 lines now. — Stephan Steiner, May 18 '20 at 10:06
The code that is embedded in the question doesn't compile. Some methods, types and interfaces are missing. — Theodor Zoulias, May 18 '20 at 11:10
Try again now.. I updated my original post from a working standalone lib (no longer synchronized with the Github project) — Stephan Steiner, May 18 '20 at 16:15
Now it compiles, but how does it start? Can you include the `Main` method? — Theodor Zoulias, May 18 '20 at 23:44
@TheodorZoulias: I've added the data producer code (not part of the problem - it in itself uses way more than 100 lines of code to do a completely random test with configurable parameters) as well as a main method. — Stephan Steiner, May 21 '20 at 19:56

cassandrad · Accepted Answer · 2020-05-15T20:03:08.063

2

Your code has so many issues making it impossible to find a leak through debugging. But here are several things that already are an issue and should be fixed first:

Looks like getQueue creates a new queue for the same user each time processUseStateUpdateAsync gets called and does not reuse existing queues:

var executor = groupStateChangeExecutors.GetOrAdd(user.UserId, getQueue());

CancellationTokenSource is leaking on each call of the code below, as new value created each time the method AddOrUpdate is called, it should not be passed there that way:

userStateChangeAborters.AddOrUpdate(user.UserId, new CancellationTokenSource(), (key, existingValue

Also code below should use the same cts as you pass as new cts, if dictionary has no value for specific user.UserId:

return new CancellationTokenSource();

Also there is a potential leak of cancelSource variable as it gets bound to a delegate which can live for a time longer than you want, it's better to pass concrete CancellationToken there:

executor.EnqueueTask(() => processUserStateUpdateAsync(user, state, previousState,
                    cancelSource.Token));

By some reason you do not dispose aborter here and in one more place:

userStateChangeAborters.TryRemove(user.UserId, out var aborter);

Creation of Channel can have potential leaks:

taskQueue = Channel.CreateBounded<AsyncWorkItem<T>>(new BoundedChannelOptions(1)

You picked option FullMode = BoundedChannelFullMode.DropOldest which should remove oldest values if there are any, so I assume that that stops queued items from processing as they would not be read. It's a hypotheses, but I assume that if an old item is removed without being handled, then processUserStateUpdateAsync won't get called and all resources won't be freed.

You can start with these found issues and it should be easier to find the real cause after that.

edited May 15 '20 at 20:03

answered May 15 '20 at 17:07

cassandrad

3,412
26
50

About the channel - it's one of several implementations I've done. I realize it drops data, but, given that every incoming status gets its own `CancellationTokenSource` and that on the previous source `Cancel` is called, the items that the Channel drops still get cancelled. As for the `userChangeAborters.TryRemove` the creation of a new `CancellationTokenSource` is wrapped into an `using` statement, shouldn't that automatically dispose when it leaves the context of that `using`? I've actually tried Disposing after `TryRemove` but it had no effect. – Stephan Steiner May 18 '20 at 16:20
About the `GetOrAdd`. You're right on the money, I'm providing the value, I should be using the valueFunc instead. And I did leave the thread safety bit out for brevity (normally the value func should be returning `Lazy` and I'm making sure no concurrent `AddOrUpdate` generates duplicate `CancellationTokenSources`. Let me work that in and let's see where we land. – Stephan Steiner May 18 '20 at 17:03
1

Using valueFuncs for `GetOrAdd` and `AddOrUpdate` indeed massively improved the situation. I'll be checking in an updated version shortly. – Stephan Steiner May 18 '20 at 17:27
Just pushed an update containing all your suggestions. I've also added all the code to ensure that the valueFactories don't have any concurrency errors (the locks may be overkill). Next thing I'm going to crank up the complexity again (one user status can trigger various followup actions which themselves can be highly concurrent) and then see what's what. A 40 Minute test with 10 producers yielded a much more stable memory profile (not fully flat though, so I'm not sure if I've reached my goal yet). – Stephan Steiner May 18 '20 at 18:27
@StephanSteiner if `aborter` was in `using`, then I just missed that part and everything is fine. From my point of view, investing in optimization of current code is pointless, even after optimizations you've made. Two reasons: 1) code didn't get simpler, you still have the complexity; 2) you still isn't sure there are no leaks and had to use additional synchronization, which makes code slower. These two are signs that chosen approach didn't work out at all. Just try to write everything anew, aiming simplicity of solution in the beginning, using as less objects as you can. Then optimize. – cassandrad May 18 '20 at 20:17
1

While testing I made some further improvements - no more delayed disposes which means I'm not kicking off tasks left and right anymore. And that lead to a stable memory usage. Your tip about `GetOrAdd` was right on the money and sent me down the right path. I will rethink the `CancellationTokenSources` down the road as `Cancel`on these almost never actually abort something where it matters. So, perhaps using a more lightweight abort mechanism would be in order. – Stephan Steiner May 21 '20 at 19:44
@StephanSteiner can you mark my reply as an answer than? :) – cassandrad May 22 '20 at 07:44

C# Abortable Asynchronous Fifo Queue - leaking massive amounts of memory

1 Answers1