0

I have used many times the BlockingCollection for implementing the producer/consumer pattern, but I have experienced bad performance with extremely granular data because of the associated overhead. This usually forces me to improvise by chunkifying/partitioning my data, in other words using a BlockingCollection<T[]> instead of BlockingCollection<T>. Here is a resent example. This works but it's ugly and error-prone. I end up using nested loops at both the producer and the consumer, and I must remember to Add what is left at the end of a producer's workload. So I had the idea of implementing a chunky BlockingCollection, that will handle all these complications internally, and will externalize the same simple interface with the existing BlockingCollection. My problem is that I haven't managed yet to match the performance of the complex manual partitioning. My best attempt still pays a performance tax of around +100%, for extremely granular data (basically just integer values). So I would like to present here what I have done so far, hoping for an advice that will help me close the performance gap.

My best attempt is using a ThreadLocal<List<T>>, so that each thread works on a dedicated chunk, removing any need for locks.

public class ChunkyBlockingCollection1<T>
{
    private readonly BlockingCollection<T[]> _blockingCollection;
    public readonly int _chunkSize;
    private readonly ThreadLocal<List<T>> _chunk;

    public ChunkyBlockingCollection1(int chunkSize)
    {
        _blockingCollection = new BlockingCollection<T[]>();
        _chunkSize = chunkSize;
        _chunk = new ThreadLocal<List<T>>(() => new List<T>(chunkSize), true);
    }

    public void Add(T item)
    {
        var chunk = _chunk.Value;
        chunk.Add(item);
        if (chunk.Count >= _chunkSize)
        {
            _blockingCollection.Add(chunk.ToArray());
            chunk.Clear();
        }
    }

    public void CompleteAdding()
    {
        var chunks = _chunk.Values.ToArray();
        foreach (var chunk in chunks)
        {
            _blockingCollection.Add(chunk.ToArray());
            chunk.Clear();
        }
        _blockingCollection.CompleteAdding();
    }

    public IEnumerable<T> GetConsumingEnumerable()
    {
        foreach (var chunk in _blockingCollection.GetConsumingEnumerable())
        {
            for (int i = 0; i < chunk.Length; i++)
            {
                yield return chunk[i];
            }
        }
    }
}

My second best attempt is using a single List<T> as chunk, that is accessed by all threads in a thread safe manner using a lock. Surprisingly this is only slightly slower than the ThreadLocal<List<T>> solution.

public class ChunkyBlockingCollection2<T>
{
    private readonly BlockingCollection<T[]> _blockingCollection;
    public readonly int _chunkSize;
    private readonly List<T> _chunk;
    private readonly object _locker = new object();

    public ChunkyBlockingCollection2(int chunkSize)
    {
        _blockingCollection = new BlockingCollection<T[]>();
        _chunkSize = chunkSize;
        _chunk = new List<T>(chunkSize);
    }

    public void Add(T item)
    {
        lock (_locker)
        {
            _chunk.Add(item);
            if (_chunk.Count >= _chunkSize)
            {
                _blockingCollection.Add(_chunk.ToArray());
                _chunk.Clear();
            }
        }
    }

    public void CompleteAdding()
    {
        lock (_locker)
        {
            _blockingCollection.Add(_chunk.ToArray());
            _chunk.Clear();
        }
        _blockingCollection.CompleteAdding();
    }

    public IEnumerable<T> GetConsumingEnumerable()
    {
        foreach (var chunk in _blockingCollection.GetConsumingEnumerable())
        {
            for (int i = 0; i < chunk.Length; i++)
            {
                yield return chunk[i];
            }
        }
    }
}

I have also tried to use as chunk a ConcurrentBag<T>, which resulted in bad performance and an issue with correctness (because I didn't use a lock). Another attempt was replacing the lock (_locker) with a SpinLock, with even worse performance. The locking is clearly the root of my problems, because if I remove it completely then my class obtains optimal performance. Of course removing the lock fails miserably with more than one producers.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • 1
    Starting title with *I need ideas* will attract the wrong type of attention. – Nkosi Jul 06 '19 at 03:46
  • @Nkosi it seems that my question attracted no attention nonetheless. – Theodor Zoulias Jul 07 '19 at 06:01
  • @TheodorZoulias, Did you try to use `BlockingCollection[]` ? – Dmitry Stepanov Jul 07 '19 at 20:42
  • @DmitryStepanov no I haven't. How would this work? – Theodor Zoulias Jul 07 '19 at 20:48
  • @TheodorZoulias, have a look at this (https://learn.microsoft.com/en-us/dotnet/standard/collections/thread-safe/how-to-use-arrays-of-blockingcollections) – Dmitry Stepanov Jul 08 '19 at 07:38
  • @DmitryStepanov the example in Microsoft's article is quite complex. I am not sure that using multiple `BlockingCollection`s will help me improve the performance of my implementation. Currently the throughput of the `ChunkyBlockingCollection1` class is around 10,000,000 items per second, using chunks of 200 items each (in a setup of 1 producer and 1 consumer). Doing the chunking manually beats my class by doing the same in less than half a second. The focus of my implementation is on extremely granular data, where there is barely any processing at all for each item. – Theodor Zoulias Jul 08 '19 at 10:56
  • First, I'd suggest reading http://www.albahari.com/threading/part5.aspx#_Concurrent_Collections and the section below that, and then see if you can find a way to tell if your threads are doing something most of the time, or if they're spending a lot of time blocking, or what. If they're mostly waiting on the producer, then you have a different problem if the queue is filling, and the producer is waiting for the consumers, or if it's pure lock/block overhead. 10,000,000/s is outside of what a lot of "normal wisdom" deals with, so be careful. – Kevin Anderson Jul 09 '19 at 13:50
  • Hi @KevinAnderson, I have read Joseph'a book quite recently. It was a good reading! My two attempts have different characteristics. `ChunkyBlockingCollection1` doesn't use locks, so there is no contention and scales excellently, at the cost of not being completely thread safe. It relies on callers for thread safety. `ChunkyBlockingCollection2` uses a shared chunk and locks on `Add` and `CompleteAdding`, so it is 100% thread safe, at the cost of having a lot of contention when the producers are more than one. – Theodor Zoulias Jul 09 '19 at 23:50

1 Answers1

1

You can try with an array for _chunk instead of using List<T>. Then, you can use Interlocked.Increment to increment the next index to populate on Add and when your count exceeds the size of your chunk, move it all to the blocking collection and reset the index in a lock, of course.

Nick
  • 4,787
  • 2
  • 18
  • 24
  • It would be very nice if I could get rid of the `lock` in the `Add` method, but I don't think it is possible. Without `lock` I risk pushing an array in the `BlockinCollection` before all its elements are filled with data. Resulting in a `NullReferenceException` when processed by the consumer thread. – Theodor Zoulias Jul 09 '19 at 16:21
  • @TheodorZoulias, you have a point. To eliminate the race-condition, if Interlocked.Increment returns a value that exceeds the size of the chunk, then you can spin-wait (using Increment.CompareExchange) to wait until the operation to move the chunk to the BlockingCollection and then reset the current index for the chunk completes. – Nick Jul 10 '19 at 10:49
  • This is worth a try! – Theodor Zoulias Jul 10 '19 at 11:05