1

I am currently trying to optimize an old and very poorly written class which processes a lot of data and thus can easily take multiple hours to run through a set of data. Collecting the data already takes a lot of time and this is what I try to improve here. I know this is rather smelly code, but it's just a test if this even improves anything, so please just focus on the issue:

I tried SemaphoreSlim and Semaphore in order to reduce the amount of tasks running concurrently. My data set would generate about 70 tasks which would probably result in thread starvation and overall performance degradation. At least it turned out to become a lot less responsive. So I tried to keep it at 5 tasks at the same time for better overall throughput.

Now when I try to wait for my task to enter the sempahore it blocks (slim semphore using await also blocks) but it never enters even though the semaphore isn't full. This code is inside an async method as slight contextual hint.

Semaphore throttle = new Semaphore(0, 5);

try
{
    foreach (var folder in folders)
    {
        // Wait in case there are already 5 tasks running to reduce thread starvation
        collectionTasks.Add(Task.Run( () =>
        {
            // ReSharper disable once AccessToDisposedClosure
            throttle.WaitOne();
            return GetGapProfiles(folder.Value, progress, token);
        }, token).ContinueWith(
            t =>
            {
                // ReSharper disable once AccessToDisposedClosure
                throttle.Release();
                return t.Result;
            }, TaskContinuationOptions.None));
    }

    // When all are loaded concat all results into one collection
    await Task.WhenAll(collectionTasks);
}
catch (Exception ex)
{
    Log.Error(ex, "Failed to collect profiles.");
}
finally
{
    throttle.Dispose();
}

I just don't understand why this blocks and never ever enters GetGapProfiles. Can anyone explain this?

SharpShade
  • 1,761
  • 2
  • 32
  • 45
  • 6
    A semaphore is blocked when it's empty, not when it's full. Yours starts out empty and never gets anything to call `Release()` on it – Biesi Sep 16 '20 at 12:37
  • 1
    By the way, you might want to use `SemaphoreSlim` rather than `Semaphore`: https://stackoverflow.com/a/29390802/3181933 If you do, I'd suggest using `WaitAsync` rather than `Wait`. – ProgrammingLlama Sep 16 '20 at 12:42
  • Okay I misunderstood the example on MSDN. It's initialized with 0, 3 but they release 3 before thei await the tasks. That I overlooked and I had in mind it worked the other way around. My bad, sorry. @John I will do so, I just tried Semaphore if slim was the issue, just to eliminate all cases. – SharpShade Sep 16 '20 at 12:42
  • 2
    By the way, you may want to use Parallel.ForEach or DataFlow, where you can explicitly configure the desired degree of parallelism. – Fildor Sep 16 '20 at 12:42
  • 1
    Ah, wait ... is `GetGapProfiles` returning a Task? – Fildor Sep 16 '20 at 12:53
  • 1
    ^^ and is it I/O-bound? – Fildor Sep 16 '20 at 13:05
  • 2
    One thing to keep in mind is Tasks are not threads. Starting a task does not start a new thread. Tasks are queued and the associated task scheduler manages actually executing the tasks. So I'm not sure you're overall goal is being meet by trying to manually manage the execution of the tasks. – MikeJ Sep 16 '20 at 13:23
  • @Fildor Yes, that method is an async one and returns a task. And yes it's heavily I/O bound. Well I know Tasks are no threads. They don't need to be. It's fine if .NET manages this on it's own, the I/O just has to be done separately and somewhat parallel. I implemented it with `Parallel.ForEach` now and it didn't really change anything but made it more readable. – SharpShade Sep 18 '20 at 10:25
  • 1
    In that case, DataFlow is probably the way to go. You can stay async _and_ configure degree of parallelism. Parallel.ForEach and async doesn't mix so well. – Fildor Sep 18 '20 at 11:06
  • Thanks @Fildor I'll look into this. – SharpShade Sep 21 '20 at 12:24

1 Answers1

-2
public static class perTaskThrottle
{
    /// <summary>
    /// Run multiple tasks in parallel - up to concurrentTasks tasks may run at any one time
    /// </summary>
    /// <typeparam name="TInput"></typeparam>
    /// <typeparam name="TResult"></typeparam>
    /// <param name="sourceItems"></param>
    /// <param name="func"></param>
    /// <param name="concurrentTasks"></param>
    /// <returns></returns>
    public static Task<IDictionary<TInput, TResult>> ForEachAsyncThrottled<TInput, TResult>(
        this IEnumerable<TInput> sourceItems,
        Func<TInput, Task<TResult>> func,
        int concurrentTasks = 1)
    {
        return ForEachAsyncThrottled(sourceItems, func, CancellationToken.None, concurrentTasks);
    }

    /// <summary>
    /// Run multiple tasks in parallel - up to concurrentTasks tasks may run at any one time
    /// </summary>
    /// <typeparam name="TInput"></typeparam>
    /// <typeparam name="TResult"></typeparam>
    /// <param name="sourceItems"></param>
    /// <param name="func"></param>
    /// <param name="token"></param>
    /// <param name="concurrentTasks"></param>
    /// <returns></returns>
    public static async Task<IDictionary<TInput, TResult>> ForEachAsyncThrottled<TInput, TResult>(
        this IEnumerable<TInput> sourceItems,
        Func<TInput, Task<TResult>> func,
        CancellationToken token,
        int concurrentTasks = 1)
    {
        var result = new ConcurrentDictionary<TInput, TResult>();

        var tasksList = new List<Task>();
        using (var semaphoreSlim = new SemaphoreSlim(concurrentTasks))
        {
            foreach (var item in sourceItems)
            {
                token.ThrowIfCancellationRequested();

                // if there are already concurrentTasks tasks executing, pause until one has completed ( semaphoreSlim.Release() )
                await semaphoreSlim.WaitAsync(perTimeSpanHelper.Forever, token).ConfigureAwait(false);

                token.ThrowIfCancellationRequested();

                Action<Task<TResult>> okContinuation = async task =>
                {
                    // the task has already completed if status is CompletedOk, but using await once more is safer than using task.Result
                    var taskResult = await task;
                    result[item] = taskResult;
                };

                // ReSharper disable once AccessToDisposedClosure
                Action<Task> allContinuation = task => semaphoreSlim.Release();

                tasksList.Add(func.Invoke(item)
                    .ContinueWith(okContinuation, TaskContinuationOptions.OnlyOnRanToCompletion)
                    .ContinueWith(allContinuation, token));

                token.ThrowIfCancellationRequested();
            }

            if (!token.IsCancellationRequested)
            {
                await Task.WhenAll(tasksList).ConfigureAwait(false);
            }
        }

        return result;
    }
}

So in your case you could use

var results = folders.ForEachAsyncThrottled( (f) => GetGapProfiles(f.Value), token, 5);
Peregrine
  • 4,287
  • 3
  • 17
  • 34
  • 2
    There are many problems with this code. 1) Assumes that the source `IEnumerable` contains unique elements. 2) Combines `ContinueWith` with `await`. 3) Race condition with the `token`. 4) In case of cancellation forgets the currently running tasks and ignores their errors. 5) Ignores all errors anyway. I suggest to just use an [`ActionBlock`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.dataflow.actionblock-1) from the TPL Dataflow library, instead of trying to replicate its functionality. – Theodor Zoulias Sep 16 '20 at 19:14