We are using jgroups-3.0.3.Final as a cluster wide locking implementation in a cluster of two nodes. Our JGroups settings(simplified) is as follows:
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">
<TCP bind_port="7800" .../>
<TCPPING .../>
<MERGE2 min_interval="10000" max_interval="30000"/>
<FD_SOCK/>
<FD timeout="3000" max_tries="3" />
<VERIFY_SUSPECT timeout="1500" />
...
<PEER_LOCK/>
</config>
We perform lock/unlock as follows:
Lock lock = getLockService().getLock("mylock");
try
{
lock.tryLock();
//do something
}
finally
{
lock.unlock();
}
We are expecting false failure detection several times a day, probably because of too low timeout value of FD. What is worse, we often have several locks hanging forever if they were obtained during such false FD.
scenario is this:
- We have a cluster view of {A,B|1}
- Wait until failure detected, but both nodes are alive (false FD).
- Node A will suspect Node B and create new view, {A|2}
- Suspected Node B will be still in view {A,B|1}.
- Node B is trying to obtain a lock "mylock".
- Node A discards grant lock messages from Node B, as it is in different view.
- View merge is performed, and new view is created - {A,B|3}
Problem: a thread which try to get "mylock" hangs in lock.tryLock();
line,
each subsequent attempts to get "mylock" fail as well.
We have used tryLock(long time, TimeUnit unit)
with timeout specified, and seems it solved the problem.
Question: Does it means that JGroups impl. of Lock.tryLock() without timeout have a bug and should be avoided?
Thanks.