30

We hit an extremely surprising exception today. Inside of a synchronized block, we call wait() and it throws IllegalMonitorStateException. What can cause this?

This is happening in well-tested open source code: http://svn.apache.org/viewvc/river/jtsk/trunk/src/com/sun/jini/jeri/internal/mux/Mux.java?view=markup#l222

We eliminated the obvious causes:

  • are we synchronized on the right variable? Yes, it's muxLock
  • is it a mutable variable? No, muxLock is final
  • are we using any weird "-XX:" JVM flags that might affect monitor behavior? No, but we are launching the JVM embedded inside a C++ app via JNI.
  • is this a strange JVM? No, it's Sun's 1.6.0_25 win/x64 JRE
  • is this a known JVM bug? Can't find anything relevant at http://bugs.sun.com/bugdatabase

So, I'm trying to think of more far-fetched explanations.

  • could an uncaught out-of-memory error cause the monitor state to be screwed up? We're looking at this, but we're seeing no evidence of memory errors yet.

UPDATE: (based on comment)

I've also verified from the stacktrace and breakpoint that the thread is indeed inside the synchronized block when the exception is thrown. It's not the case that some other unrelated code is emitting the exception (unless something is REALLY confusing Eclipse!)

Chris Aldrich
  • 1,904
  • 1
  • 22
  • 37
Chris Dolan
  • 8,905
  • 2
  • 35
  • 73
  • 2
    Is it possible to edit the source and/or get at the underlying cause of the exception? See if it was thrown by a different piece of code, etc? – Chris Aldrich Sep 21 '11 at 16:27
  • @ChrisAldrich - Good question. No, we caught the stack trace one of the times it happened. See this screen capture of Eclipse sitting on an uncaught exception breakpoint (sorry, should have grabbed the trace as as text...) http://i.imgur.com/UPPQS.png – Chris Dolan Sep 21 '11 at 16:32
  • 2
    It does look like a bug, Java 6 update 26 came soon after update 25. I would try that. – Peter Lawrey Sep 21 '11 at 16:43
  • @PeterLawrey - Good idea. I looked at the bug lists in the 1.6.0_26 and _27 release notes and nothing jumps out. We're testing older JREs just in case. http://www.oracle.com/technetwork/java/javase/2col/6u27bugfixes-444150.html and http://www.oracle.com/technetwork/java/javase/6u26releasenotes-401875.html – Chris Dolan Sep 21 '11 at 16:53
  • 1
    Is it possible to run the java code independently of JNI to check that the C++ program hasn't corrupted the JVM by accident? – Dunes Sep 21 '11 at 17:57
  • Does this problem occur regularly or was it a one-time event? Is it possible that at the moment of occurance the class in question was being reloaded or unloaded while the loop still ran? (automatic redeploy in a webcontainer maybe?) – rsp Sep 21 '11 at 18:39
  • @rsp - repeatable. How could a class be unloaded while it's executing? – Chris Dolan Sep 21 '11 at 20:19
  • @Chris, if the classloader which loaded the class is destroyed for instance (do you use a custom classloader to load the Mux classes?). I have seen strange behaviour during shutdown of application servers sometimes that looked a bit like this behaviour. The solution was to make sure all background processing was stopped before the actual shutdown happens. – rsp Sep 22 '11 at 08:35
  • @rsp - thanks for the info. Apache River uses a lot of custom classloaders for client code, but this particular class is running in the system classloader. Besides, I believe that classloaders are collected rather than destroyed, so it should not be possible to unload one while one it's classes is still executing. – Chris Dolan Sep 22 '11 at 18:05
  • Can you post the entire exception somewhere @Chris? – Gray Sep 22 '11 at 19:10
  • 1
    @Gray - what I was given by the team is this screenshot: http://i.imgur.com/UPPQS.png – Chris Dolan Sep 22 '11 at 19:15
  • 4
    I saw that. I was hoping for the entire exception showing the stack trace. Two things I notice. In your eclipse screenshot it lists the line number as 231 but the source shows 222. Some sort of source/class mismatch? Also `muxLock` usage looks fine in that class but it has package permissions. Any chance this exception is elsewhere in another class and the debugging is tricking you to think it is happening there? Next time in the debugger I'd dump the stack trace from the exception directly. – Gray Sep 22 '11 at 19:41
  • @Gray - sorry for the misunderstanding. Line 231 is correct - that's where the muxLock.wait() is invoked. The line numbers don't quite match up with SVN because we're running a slightly older version. But that reminds me that I should check what's changed... The other-class trickery idea concerns me. It's true that we're relying heavily on the evidence that the debugger is showing us. I'll have the team add some debug print code and try again. Thanks! – Chris Dolan Sep 22 '11 at 19:46
  • If you are running a slightly older version, perhaps there is a bug in the old version. Another possible warning is that Mux leaks a reference to `this` during object construction. This makes it possible for those classes to access the methods with the lock being null. I cannot see how you could get your error though. – Alexander Torstling Sep 23 '11 at 09:26
  • @rsp For `static final` fields? Tomcat very naughtily nulls these out when discarding a `ClassLoader`. Admittedly it does this because of long-standing bugs in the Java Library implementation. – Tom Hawtin - tackline Sep 24 '11 at 12:24
  • Have you tried compiling river from source with the latest version of Java? Also, in `jini-ext.jar` what is the information in the `META-INF/MANIFEST.MF` ? – beny23 Sep 27 '11 at 13:41
  • @ChrisAldrich: one other idea is that the object referenced by the final field can be changed on object serialization/deserialization. I didn't see anything like that in the bit of code that I looked at but maybe worth a look. Otherwise, like others, I'd say your best path forward is to look closely at the C++ used in the JNI and be sure it isn't doing something to affect the muxLock field state. – philwb Sep 29 '11 at 14:18
  • @philwb I think you mean Chris Dolan. :) Its his question. I'm just contributing the bounty. – Chris Aldrich Sep 29 '11 at 16:05
  • Did you try logging lock instance (System.identityHashCode) just before synchronized on lock and just before entering lock.wait()? This can shows you (may be) does the lock instance actually different. – BegemoT Sep 30 '11 at 09:40

4 Answers4

6

The only suspicious thing I see that you are passing a reference to 'this' to some other object in your constructor. Is it possible (in fact, not unlikely) that, through weird re-ordering of things, if some other thread gets that reference to 'this' and calls the method that uses the muxlock, things can go extremely wrong.

The Java Language Specification is pretty specific about this:

An object is considered to be completely initialized when its constructor finishes. A thread that can only see a reference to an object after that object has been completely initialized is guaranteed to see the correctly initialized values for that object's final fields.

In other words, if another thread gets hold of the 'this' reference before the constructor is finished, the final field 'muxlock' might not be correctly initialized yet. In general, publishing a reference to 'this' before the constructor has finished can be pretty dangerous, especially in threaded situations.

Some potentially useful discussion about such things: http://madpropellerhead.com/random/20100328-java-final-fields-are-not-as-final-as-you-may-think

For some older, but still useful general discussion of why publishing 'this' in a constructor is a very bad idea in general, see for instance: http://www.ibm.com/developerworks/java/library/j-jtp0618/index.html

Jan de Vos
  • 3,778
  • 1
  • 20
  • 16
  • 1
    @Jan de Vos: I completely agree with the dangers of leaking the this reference from a constructor but I don't see how it could explain what is seen here. If the reference leaked and was accessed prior to the set of muxLock then there should be a NPE at the synchronized statement. Instead, he gets past the statement and all the way to the object wait, which should mean that, leaked reference or not, muxLock was already set. As far as I know, final fields can be seen with either an uninitialized value (null for an object) or with the initialized value; no way I can see for it to have changed. – philwb Sep 29 '11 at 14:13
  • This is not the right answer for my actual problem (it's probably caused on the C++ side), but I'm going to accept it because it's probably the right answer for any future visitors. Thanks! – Chris Dolan Sep 30 '11 at 13:02
  • Conferred with Chris Dolan about the bounty and we decided to award it to your answer. – Chris Aldrich Sep 30 '11 at 13:18
2

http://svn.apache.org/viewvc/river/jtsk/trunk/src/com/sun/jini/jeri/internal/mux/Mux.java?r1=1069292&r2=1135026&diff_format=h

here i can see that timeout was added lately

make sure that startTimeout is > than 0 otherwise you will wait(0) or wait(-n) this probably cause IllegalMonitorStateException

EDIT: Ok above is a disaster But lets try this :

we are in Mux constructor : http://svn.apache.org/viewvc/river/jtsk/trunk/src/com/sun/jini/jeri/internal/mux/Mux.java?view=markup

line 176 we create SocketChannelConnectionIO andd pass this after that we break and and different thread takes over .

in constructor of SocketChannelConnectionIO defined here : http://svn.apache.org/viewvc/river/jtsk/trunk/src/com/sun/jini/jeri/internal/mux/SocketChannelConnectionIO.java?view=markup line 112 we register to channel with the new handler().

handler recieaves something on chanel and function let say function handleReadReady is executed we synchronize on muxLock .

now we are still in constructor so object in final is still mutable !!! let assume it changes , now we have something waiting on different muxLock

One in a million scenario

EDIT

http://svn.apache.org/viewvc/river/jtsk/trunk/src/com/sun/jini/jeri/internal/mux/Mux.java?revision=1135026&view=co

Mux(SocketChannel channel,
    int role, int initialInboundRation, int maxFragmentSize)
    throws IOException
    {
    this.role = role;
    if ((initialInboundRation & ~0x00FFFF00) != 0) {
        throw new IllegalArgumentException(
        "illegal initial inbound ration: " +
        toHexString(initialInboundRation));
    }
    this.initialInboundRation = initialInboundRation;
    this.maxFragmentSize = maxFragmentSize;

    //LINE BELOW IS CAUSING PROBLEM it passes this to SocketChannelConnectionIO
    this.connectionIO = new SocketChannelConnectionIO(this, channel);

    //Lets assume it stops here we are still in constructor
    //and we are not in synchronized block

    directBuffersUseful = true;
    }

now in constructor of SocketChannelConnectionIO http://svn.apache.org/viewvc/river/jtsk/trunk/src/com/sun/jini/jeri/internal/mux/SocketChannelConnectionIO.java?revision=1069292&view=co

SocketChannelConnectionIO(Mux mux, SocketChannel channel)
    throws IOException
{
    super(mux);
    channel.configureBlocking(false);
    this.channel = channel;
    //Line below we are registering to the channel with mux that is still mutable
    //this is the line that actually is causing the problem move that to 
    // start() and it should work 
    key = selectionManager.register(channel, new Handler());
}

move this code to start() should work key = selectionManager.register(channel, new Handler()); (i am assuming start is executet when we want to start prosessing)

/**
 * Starts processing connection data.
 */
void start() throws IOException {
    key = selectionManager.register(channel, new Handler());
    key.renewInterestMask(SelectionKey.OP_READ);
}

But it would be much better not to create SocketChannelConnectionIO in the constructor of mux but maybe somewhere after that the same for second constructor creating StreamConnectionIO with this

firegnom
  • 833
  • 7
  • 20
1

The answer is in my opinion that its either a bug, or someone changed the object behind the reference despite its being final. If you can reproduce it, I recommend to set a read/write breakpoint on muxlock field to see if it is touched or not. You could check the identityhashcode of the muxlock in the first line of the synchronized block, and before waits and notifies with appropiate log entries or breakpoints. With reflection you can change final references. Quote from http://download.oracle.com/javase/6/docs/api/java/lang/reflect/Field.html:

"If the underlying field is final, the method throws an IllegalAccessException unless setAccessible(true) has succeeded for this field and this field is non-static. Setting a final field in this way is meaningful only during deserialization or reconstruction of instances of classes with blank final fields, before they are made available for access by other parts of a program. Use in any other context may have unpredictable effects, including cases in which other parts of a program continue to use the original value of this field."

Maybe its a bug in eclispe, and during debugging it somehow changes the field. Is it reproducable outside eclispe as well? Put a printstractrace in catch and see what happens.

Gábor Lipták
  • 9,646
  • 2
  • 59
  • 113
  • 1
    There's no reflection or weaving going on in that code, and it's not serializable. I'm confident that the lock is not being intentionally altered by any code. – Chris Dolan Sep 24 '11 at 03:30
  • @Gabor Liptak - Did you take a look at the link posted above for the source code? I'd second what Chris Dolan says. – Chris Aldrich Sep 26 '11 at 12:45
  • @ChrisAldrich Yes I checked the sourcecode including the connectionmanager and its embedded clas outboundmux. I looked for muxlock changes. I cannot understand your sentence "I'd second what Chris Dolan says.". What does it mean? – Gábor Lipták Sep 26 '11 at 13:34
  • Just means I agree with his comment. – Chris Aldrich Sep 26 '11 at 15:12
  • @ChrisAldrich Ok, allright. On the other hand I still would give a try to the things I said to be 100 percent sure. Almost everytime I thought I have found a bug, it turned out that I did the mistake. – Gábor Lipták Sep 26 '11 at 20:02
  • An update: we're experimenting with -XX:-UseBiasedLocking and -XX:-UseCompressedOops to see if those make a difference. If not, then I strongly suspect the problem must be on the C++ side of the JNI boundary because we've ruled out pretty much all of the Java-side possibilities. – Chris Dolan Sep 27 '11 at 13:24
  • @ChrisDolan let us know on the result. It is a conundrum. – MarianP Sep 28 '11 at 08:49
  • @ChrisDolan What about the use of reflection Proxies? It is hard to follow the logic for me, but from what I gather out of source code of BasicInvocationHandler.invokeRemoteMethod and ConnectionManager.connect, the mux object containing the muxLock might be not a real Mux, but a java.lang.reflect.Proxy to a Mux object. There might be several issues with this then, e.g. Object.wait/notify/notifyAll are directly passed to proxied object, or the use of 'final Mux mux' object in some classes might not be making the mux final, just its proxy instead and making mux mutable or volatile then? – MarianP Sep 30 '11 at 11:18
  • @MarianP This would be obvious, if he would try the write breakpoint. It could be seen if it is a proxy or not. – Gábor Lipták Sep 30 '11 at 12:17
  • I don't think it would be obvious. If you suspend a thread inside a method of the proxied class, you won't see if it is proxied unless you explicitly test 'this' variable. Looking at the source code and the stacktrace picture, the Connection#OutboundMux where the wait is called IS a Proxy in my opinion. I am not able to say where exactly the problem lies, but my intuition tells me the problem is badly written code for communicating with remote stuff. Distributed frameworks often make programming look like programming inside a single JVM/endpoint and problems like this stand out then. – MarianP Sep 30 '11 at 13:14
  • @MarianP - the Mux is a concrete object that dispatches to runtime-generated proxies. The Mux is part of the communication layer in Jini that is concealed from the end programmer by the proxies of the domain interfaces. The River code is indeed hard to follow (it took me years to really understand what was going on) but it's pretty remarkable. – Chris Dolan Sep 30 '11 at 19:37
1

Member variables are not as final as one would hope to. You should put the synchronized object into a final local variable, first. This does not explain why the member variable is altered, but if it fixes the problem you at least know that the member variable is really modified.

Christian
  • 13,285
  • 2
  • 32
  • 49