Lock.tryLock() in suspecting node are hanging forever after false Failure Detection

Question

We are using jgroups-3.0.3.Final as a cluster wide locking implementation in a cluster of two nodes. Our JGroups settings(simplified) is as follows:

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">
    <TCP bind_port="7800" .../>
    <TCPPING .../>
    <MERGE2  min_interval="10000" max_interval="30000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500" />
    ...
    <PEER_LOCK/> 
</config>

We perform lock/unlock as follows:

Lock lock = getLockService().getLock("mylock");
try
{
   lock.tryLock();
   //do something
}
finally
{
   lock.unlock();
}

We are expecting false failure detection several times a day, probably because of too low timeout value of FD. What is worse, we often have several locks hanging forever if they were obtained during such false FD.

scenario is this:

We have a cluster view of {A,B|1}
Wait until failure detected, but both nodes are alive (false FD).
Node A will suspect Node B and create new view, {A|2}
Suspected Node B will be still in view {A,B|1}.
Node B is trying to obtain a lock "mylock".
Node A discards grant lock messages from Node B, as it is in different view.
View merge is performed, and new view is created - {A,B|3}

Problem: a thread which try to get "mylock" hangs in lock.tryLock(); line, each subsequent attempts to get "mylock" fail as well.

We have used tryLock(long time, TimeUnit unit) with timeout specified, and seems it solved the problem.

Question: Does it means that JGroups impl. of Lock.tryLock() without timeout have a bug and should be avoided?

Thanks.

score 0 · Answer 1 · answered Jul 15 '13 at 09:17

0

Additionally to timeout increase and using tryLock() with timeout, it would be better to change PEER_LOCK to CENTRAL_LOCK. Please see details here: https://community.jboss.org/message/827520

answered Jul 15 '13 at 09:17

zxeka

43
3

Lock.tryLock() in suspecting node are hanging forever after false Failure Detection

1 Answers1