1

I am having stability problems with OpenCMS. When i do a thread dump, many threads (400) are waiting on the synchronized (m_processingFiles) block in the following code:

public class CmsJspLoader ... {
...

private static Set m_processingFiles = Collections.synchronizedSet(new HashSet());

...
...
...
public String updateJsp(...) {
....

while (m_processingFiles.contains(jspVfsName)) {
    // wait a little bit until the first thread finishes
    try {
        synchronized (m_processingFiles) {
            m_processingFiles.wait(100);
        }
    } catch (InterruptedException e) {
        // ignore
    }
}
...
}
...
}

The code is part of OpenCMS. There is no notify() anywhere in the code. There is no change of state or reading shared variable inside the shown sync block. However, there are 400 threads waiting on it, that means, just to pass through this sync the last one should wait 40sec!!!

I simply do not understand the purpose of it. Is there something i am not seeing?

Op De Cirkel
  • 28,647
  • 6
  • 40
  • 53
  • Is there any other place where `m_processingFiles` being synchronized again ? – Chris Sep 15 '12 at 06:46
  • Well, processingFiles is `synced Set`, that means all the operation like `contains` in the code are `synced`, and yes (to your question) there are some other manipulations of the set, but the code i posted simply does not make any sense – Op De Cirkel Sep 15 '12 at 06:49
  • This is in-fact waiting until the first thread finishes to even enter the `synchronized` block, let alone to `wait`... – obataku Sep 15 '12 at 06:57
  • 2
    Not really, `wait()` allows another thread to enter, but that one will go to sleep on `wait()` too, then next one enters, and so on, until 100ms pass, then one of the _waiters_ will wake up and go on ... so strange ... who know what the guy had smoked when wrote that code – Op De Cirkel Sep 15 '12 at 07:00
  • One might *guess* that the implementer thought that the synchronized set was calling notify - thereby waiting for something to change on the set. Except it doesn't work that way... – Richard Sitze Sep 15 '12 at 07:09
  • @OpDeCirkel you can't `wait` until the lock is acquired in the first place... that was my point. – obataku Sep 15 '12 at 07:15

2 Answers2

2

If there are no notify or notifyAll calls this will essentially act like a 100ms sleep. After 100ms the thread will wake back up and continue. Assuming there is nothing but the wait within the synchronized block then this is just an odd way of sleeping. There are likely some other impacts caused by the synchronized block triggering a memory barrier. So it may be there is some subtle thread safety thing going on.

One thing to be careful of in interpreting thread dumps here are are the 400 threads blocked waiting to get into the synchronized block or waiting on the wait? When a thread enters the wait it essentially releases the lock from the synchronized block so another thread can enter it. The lock will be re-aquired as the thread wakes up from the wait.

If the thread dumps says something like "waiting for monitor entry" then one thread is in the synchronized block and all the others are trying to enter. This would suggest you have a major concurrency problem here.

If however the thread dump shows something like "in Object.wait()" then it means the thread is in that 100ms wait and other threads are free to enter the synchronized block. In this case that would mean that the loop condition is still false, so look at what is going on on that side, not what's going on with the waiting threads.

That said, if it is possible to execute a notify / notifyAll when whatever the other side of the process is it's going to reduce the lag and cost of waking up and checking that the threads should still be sleeping.

EdC
  • 2,309
  • 1
  • 17
  • 30
2

There must be some place in the code where a thread adds jspVfsName to m_processingFiles, does some more work, and then removes jspVfsName from m_processingFiles. If this wasn't the case, then your other threads would simply wait() forever in that while loop. For some reason, the implementer didn't want any other threads to execute updateJsp while that other processing is going on.

I recommend you examine the code to see what jspVfsName actually is, and find where in the code it could be added/removed from m_processingFiles. Perhaps then you will also understand why the author didn't want updateJsp to run while jspVfsName is in m_processingFiles.

Once you find that, you can examine that "other" code to see if jspVfsName could ever be added to m_processingFiles and never removed. If so, that would (naturally) cause livelock, which would explain your stability problems.

Or could it be that updateJsp is being called very frequently, and the "other" code which modifies m_processingFiles is also being called very frequently, to the point that it causes a major concurrency bottleneck? Could there be something wrong with your app which is causing updateJsp to be called more often than it should (perhaps on every request, rather than every time new JSP files are placed on the server)?

If updateJsp is running very frequently, but it's not caused by a problem with your app, you could try simply shorten the wait() period. That shouldn't hurt anything -- it will just make the waiting threads check whether jspVfsName is still in m_processingFiles more frequently. 100ms is a long, long time in CPU terms!

Alex D
  • 29,755
  • 7
  • 80
  • 126
  • That is fine, i understand that the while loop acts like a barrier. But the way is implemented, makes the threads (those that entered the loop, and in my case most of the threads i have) wait 100ms regardless of the condition and they wait serially . The actual solution would be to put manipulation of `processingFiles` in the loop and use `wait/notify` or something similar that does not cause so erratic scalability problems – Op De Cirkel Sep 15 '12 at 07:22
  • 1
    That's what you *think*, but as a wise programmer once said, "all problems are caused by thinking". (Because after the problem is solved, you will always say "but I thought...") To really know what is going on, you need to find out what `jspVfsName` is, where in the code it is added/removed from `m_processingFiles`, and what "other" processing is happening *while* it is in `m_processingFiles`. Until you find that out, you're just guessing what the problem is. – Alex D Sep 15 '12 at 07:28
  • If you discover that the problem really is what you think it is, changing the `wait()` to 1ms shouldn't hurt. Actually, even 1ms is a *long* time in CPU terms. If each of your 400 threads wakes up every 1ms and checks `m_processingFiles`, the overhead should still be not too high. – Alex D Sep 15 '12 at 07:29
  • 1
    Let me point out just one more thing: if 400 threads all execute a `wait()` on the same monitor, that does *not* mean they will wait serially. They will all wait until 100ms has expired, and then they will *wake up* serially (because each thread has to acquire the monitor to come out of `wait()`). But waking up and coming out of the `synchronized` block should be fast, so all of them should be out of the block in a *little* more than 100ms. – Alex D Sep 15 '12 at 07:32
  • The facti is: .... which is more than the configured time (StuckThreadMaxTime) of "***600" seconds***. Stack trace:Thread-470 "[STUCK] ExecuteThread: '402' for queue: 'weblogic.kernel.Default (self-tuning)'" { -- ***Waiting for notification*** on: java.util.Collections$SynchronizedSet@1822b918[fat lock] org.opencms.loader.CmsJspLoader.updateJsp(CmsJspLoader.java:595) – Op De Cirkel Sep 15 '12 at 07:48
  • Threads usually do not wake up on `wait()` (except _spurious wakeup_) and they eventually wake up on `wait(ms)` but that is not guranteed. `notify()` is must for normal operation. – Op De Cirkel Sep 15 '12 at 07:52
  • >>> _coming out of the synchronized block should be fast_: This lock (monitor) has huge contention, and the JVM transforms contended locks to _fat_ locks, which means, (the OS) will suspend the threads that try to acquire it if already taken. It can take ages a thread to pass through. – Op De Cirkel Nov 13 '13 at 05:03