multithreading's performance is related to global queue's length if queue is not locked?

Question

The requirement is: Items to deal with are stored in a global queue. Several handler threads get item from global queue to handle. Producer thread adds item to global queue continuously and rapidly(much faster than all dealer threads' processing speed. Also, handler thread is compute-intensive. The best performance is CPU used completely). So, I use one more countKeeping thread to keep the queue's length to a specific range, just like from BOTTOM to TOP roughly(just to keep the memory from used too much).

I use ManualResetEvent to deal with 'can add to queue' status change. Global queue is

Queue<Object> mQueue = new Queue<Object>;
ManualResetEvent waitingKeeper = new ManualResetEvent(false);

Handler thread is

void Handle()
{
    while(true)
    {
        Object item;
        lock(mQueue)
        {
            if(mQueue.Count > 0)
                item = mQueue.Dequeue();
        }
        // deal with item, compute-intensive
    }
}

Producer thread will call AddToQueue() function to add item to mQueue.

void AddToQueue(Object item)
{
    waitingKeeper.WaitOne();
    lock(mQueue)
    {
        mQueue.Enqueue(item);
    }
}

countKeeping thread is mainly like following

void KeepQueueingCount()
{
    while(true)
    {
        // does not use 'lock(mQueue)'
        // because I don't need that specific count of the queue
        // I just need the queue does not cost too much memory
        if(mQueue.Count < BOTTOM)
            waitingKeeper.Set();
        else if(mQueue.Count > TOP)
            waitingKeeper.Reset();
        Thread.Sleep(1000);
    }
}

Here problom comes.
When I set the BOTTOM and TOP to smaller number, like BOTTOM = 20, TOP = 100, it works well with quad core CPU(CPU utilization is high), but with single CPU it works not that well(CPU utilization fluctuates relatively great. ).
When I set the BOTTOM and TOP to larger number, like BOTTOM = 100, TOP = 300, it works well with single CPU, but not that well with quad core CPU.
Both environment, both condition, the memory is not used too much(around 50M at most).

Logically, larger BOTTOM and TOP will help with the performance(when memory is not used too much), bacause there are more items for handler threads to handle. But the fact seems not like this.

I tried several ways to find the cause of the problem. And I just found, when I use lock(mQueue) in keeping thread, it works both well in two above CPU conditions.
New countKeeping thread is mainly like this

void KeepQueueingCount()
{
    bool canAdd = false;
    while(true)
    {
        lock(mQueue)
        {
            if(mQueue.Count < BOTTOM)
                canAdd = true;
            else if(mQueue.Count > TOP)
                canAdd = false;
        }
        if(canAdd)
            waitingKeeper.Set();
        else
            waitingKeeper.Reset();
        // I also did some change here
        // when the 'can add' status changed, will sleep longer
        // if not 'can add' status not changed, will sleep lesser
        // but this is not the main reason
        Thread.Sleep(1000);
    }
}

So my questions are

When I didn't use lock in countKeeping thread, why the range of global queue affect performance(here, performance is mainly CPU utilization) in different CPU conditions?
When I use lock in countKeeping thread, the performance is both well in different conditions. What does lock do really affect this?
Is there any better way to change 'can add' status instead of using ManualResetEvent?
Is there any better model fit my requirement? Or is there any better way to keep memory from not used too much when producer thread works continuously and rapidly?

---UPDATE---
Producer thread's main part is as following. STEP is the count of items for each query from database. Querying is successively and in order until all items queryed.

void Produce()
{
    while(true)
    {
        // query STEP items from database
        itemList = QuerySTEPFromDB();
        if(itemList.Count == 0)
            // stop all handler thread
            // here, will wait for handler threads handle all items in queue
            // then, all handler threads exit
        else
            foreach(var item in itemList)
                AddToQueue(item);
    }
}

When you say _continuously_ do you mean _continually_ or _constantly_? The word _continuously_ doesn't make sense for discrete actions. — Enigmativity, May 19 '15 at 04:09
How is `AddToQueue` being called? Where's the code for that? — Enigmativity, May 19 '15 at 04:15
Have you had a look at using [`BlockingCollection`](https://msdn.microsoft.com/en-us/library/dd997371%28v=vs.110%29.aspx) with an underlying `ConcurrentQueue` as your queue? — Alex, May 19 '15 at 04:16
Can you explain why you need to do this? What's the problem you're trying to solve? — Enigmativity, May 19 '15 at 04:28
`AddToQueue` is called in **producer thread**. In **producer thread**, there is an approximate `while` operation: query items from a large database and add to global queue. Several **handler thread**s get item from global queue, do some computation, then write compute result to database. @Enigmativity — Wentworth X, May 19 '15 at 05:20
How are you querying the database? And what is the logic if your queue is full? Just skip values from the database? Delay them? — Enigmativity, May 19 '15 at 05:36
I update my question, put the main part of **producer thread** on it. As for what if global queue is full, actually, this won't happen because **countKeeping thread**'s existence. **CountKeeping thread** will keep the queue in a specific range: from ***BOTTOM*** to ***TOP*** @Enigmativity — Wentworth X, May 19 '15 at 06:57
@XiaoF - That's what I meant by when the queue is full. In other words, what does the producer do when queue hits TOP? — Enigmativity, May 19 '15 at 07:51
@Enigmativity - When queue hits TOP, **producer thread** will wait at the `AddToQueue`, until ManualResetEvent *waitingKeeper* is set in **countKeeping thread**. — Wentworth X, May 19 '15 at 07:56
@XiaoF - But it appears that your code tries to enqueue everything even with the `waitingKeeper` in `KeepQueueingCount`. So if there query has a 1,000 items to add it could wait in `KeepQueueingCount` 1,000 times before executing the query again. What does the query return? — Enigmativity, May 19 '15 at 07:59
@Alex - Seems `BlockingCollection` is much more convenient, but it requires Framework 4 and upper version. My project is based on Framework 3.5. — Wentworth X, May 19 '15 at 08:08
@XiaoF, the source code is available though from [microsoft's reference source](http://referencesource.microsoft.com/#System/sys/system/collections/concurrent/BlockingCollection.cs,0c84922eea8bf511). It is not exactly small, but may provide you with some inspiration and implementation ideas and may be stripped to a more bare bones implementation. — Alex, May 19 '15 at 08:11
@Alex - Thank you so much for this link! I will study it particularly. — Wentworth X, May 19 '15 at 08:40
@Enigmativity - **CountKeeping thread** runs `KeepQueueingCount` function. It will check queue's count per second. **Producer thread** runs `Produce` function. If BOTTOM = 100, TOP = 200, STEP = 1000, maybe the former second, 90 items add to queue. At this time, `waitingKeeper` is still reset, queue can still be added. Then next second, maybe 180 items in queue, `waitingKeeper` will be set by **CountKeeping thread**, no items can be add to queue. At this time, **producer thread** waits here, **handler thread**s work. Until maybe next second, only 80 items in queue, `waitingKeeper` is reset. — Wentworth X, May 19 '15 at 08:42
@XiaoF - So it returns fewer records from the query depending on how many it is adding to the queue? It's effectively dropping excess records? — Enigmativity, May 19 '15 at 08:48
@Enigmativity - Each query operation returns specific count of records, and the count is STEP. The records are in memory, but not the queue. Only when queue's count is from BOTTOM to TOP, they can be added to the queue one by one. If the queue's count is larger than TOP, **Produce thread** won't query records from database to memory, and won't add records from memory to queue. — Wentworth X, May 19 '15 at 08:56

score 4 · Accepted Answer · 2015-05-19T06:34:44.293

4

Your concurrent queue example is a classic example of where the atomic compare and swap spinlock tends to do considerably better, given very high contention but very little time spent in a lock (just the time to queue and dequeue).

https://msdn.microsoft.com/en-us/library/dd460716%28v=vs.110%29.aspx

It's also worth noting that .NET already has a concurrent queue provided for you which uses that kind of atomic CAS spinlock design.

Higher-level locks get very expensive if you have a very highly-contended shared resource that is only used for a brief amount of time.

If I use a crude visual analogy (with exaggerated, human-level time units), imagine you have a store and there's a line. But the clerks are really fast at their job, the line moves every second.

If you use a critical section/mutex here, it's like every customer dozes off and takes a nap when they find it's not their turn yet. Then when it's their turn, someone has to wake them up: -"Hey you, it's your turn now! Wake up!" -"Wha--huh? Oh okay." As you can imagine, because of the additional time blocking/suspending threads, you can also tend to form bigger and bigger lines with threads waiting for their turn.

This is also why you're seeing fluctuations in CPU utilization. The threads can form a traffic jam around the lock and get suspended/put to sleep, and that takes away from CPU utilization while they're sleeping and waiting for their turn. This is also rather indeterministic, as multithreading does not necessarily execute code in a perfectly pre-defined order, so you can see spikes if your design can allow threads to form a traffic jam around locks. You might get lucky in one session and get fast performance in such time-sensitive cases, and then get unlucky in another and get very poor performance. In worst case scenarios, you can actually get CPU utilization below that of single-threading with excessive locks (worked in a codebase once where the devs had a habit of putting mutexes around everything and I was often looking at 10% CPU utilization in performance-critical areas on a 2-core machine -- this was in a legacy codebase where the devs tried to multithread more as an afterthought, thinking it's okay to just sprinkle locks everywhere, instead of designing the code for multithreading).

If you use a low-level spinlock here, it's like the customers don't doze off when they find there's a line. They just stand around waiting very impatiently and constantly looking to see if it's their turn. If the line is moving really fast, then this can work much better.

Your problem is somewhat unusual in that your producers produce much faster than the consumers consume. The upper limit idea on how much can be produced at once seems sensible here, but you can kind of throttle the processing that way. I'm also not sure why you solve that in a separate count keeper thread (didn't quite understand that part). I think you can just make it so your producers don't enqueue items if some upper limit is reached until the queue gets smaller.

You might want to keep that upper limit to avoid spamming memory, but I think you'll do better (after using the appropriate lock) to sleep or yield the producers to balance the processing distribution and skew it more towards your consumers whenever a producer enqueues an item to process. This way you don't end up jamming up your producers when they reach that limit -- instead the point is to avoid reaching that limit so that your consumers have a chance to consume at a rate that doesn't lag significantly behind the producers.

edited May 19 '15 at 06:34

answered May 19 '15 at 04:35

Thank you for the answer! The performance problem I met occurred just when I didn't use any kind of lock in **CountKeeping thread**, so I guess the main reason is "with or without lock" rather than "with which kind of lock". As for **CountKeeping thread**, I just use it to keep queue's count in a range. You said it can be made by control my producer. Can you be more specific at this? How can the producer know the queue's count is larger than the top of the range? I think judge when each calling of `AddToQueue` costs too much. Do you have better solution? :) – Wentworth X May 19 '15 at 09:15
It's worth noting that `ManualResetEvent` is still a kind of lock object even though it uses signals. It still locks/blocks threads waiting for the signal. It should actually be a bit more expensive than a lock, since it requires that interthread communication to unblock threads making it kind of a resource+lock at the same time. I think you could potentially get some gains and a simpler result by simply having your producers check the size of the global queue inside a spinlock. If it's too big, unlock and sleep and try again a little later... – May 19 '15 at 09:23
... This way, your consumers don't end up lagging behind too much, and the sleeps in the producers will yield more time for the consumers (handlers) to do their thing. – May 19 '15 at 09:24
Like you said, **CountKeeping thread** can be removed, and let producer do its job. One more question, just the "with or without lock" concern: The queue is changed very frequently. When I get queue's count to judge whether it reaches TOP, if I don't use SpinLock, just get the count directly, seems it is slower than using SpinLock. For frequent varying collection, when get the property of it, is using lock better than without lock? – Wentworth X May 19 '15 at 10:04
Without a lock, you've got a race condition if a thread is trying to simultaneously access the size of the queue while another is changing the size of the queue (unless you use concurrent_queue which already does the locking for you). So lockless should be faster, but it might be slower because it's incorrect and doing something weird as a result of the race condition. Here you definitely need the lock for correct behavior, and a spin lock tends to fit your needs (usually spin locks are kind of bad, but for your specific case, they tend to fit). – May 19 '15 at 10:07
NP -- normally you only want as many locks as there are shared resources being concurrently accessed. Here you just have one: the global queue, so you only need one lock. It's also worth noting that locks are sometimes implicit, like with `ManualResetEvent` which is a lock+communication object, even though it doesn't blatantly look like a lock. You usually don't need to go any further beyond this point except in rare-case scenarios where you have very high contention to a resource which is used only for a brief period of time in the lock. There, spinlocks can actually be your friend. – May 19 '15 at 10:13
If you want to start scraping metal with performance, then you want to try to start splitting highly contended shared resources apart if you can so that they aren't shared (or not shared as much). For example, you might have each producer output to its own thread-local queue, wait-free, and just transfer it periodically. That's useful if you want to speed up your producers who are trailing behind your consumers. You seem to have it the other way around, so you might actually want the producers to slow down a bit and take it easy with some sleeps in there so that your consumers have more time. – May 19 '15 at 10:19
My project is based on Framework 3.5, so I think I can use `Monitor` to do this. – Wentworth X May 19 '15 at 10:28
Oh my bad, I wasn't aware that you lacked SpinLock there. Here's a super featherweight implementation of a 3.5-compatible SpinLock that provides a superset of the SpinLock interface found in later .NET versions (so that it'll be easy to port your code if you ever want to). It might be of interest as it shows you how to make your own SpinLock using atomic CAS (aka Interlocked Compare/Exchange). https://code.google.com/p/accord/source/browse/trunk/Sources/Accord.Core/Compatibility/SpinLock.cs?r=366 – May 19 '15 at 10:37
If you're really interested in implementing very low-level core-type functionality like your own concurrent queue, memory allocators, etc. in a wait-free way, then the atomic compare/swap or interlocked compare/exchange is one of the most important tools there. This is somewhat delicate coding and not always encouraged for more daily usage, since it's a bit fragile and easy to make mistakes leading to horrible race condition bugs. But if you're going for max performance and use them with discretion, then they can be useful tools in your arsenal. – May 19 '15 at 10:41

multithreading's performance is related to global queue's length if queue is not locked?

1 Answers1