0

We are using jgroups-3.0.3.Final as a cluster wide locking implementation in a cluster of two nodes. Our JGroups settings(simplified) is as follows:

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">
    <TCP bind_port="7800" .../>
    <TCPPING .../>
    <MERGE2  min_interval="10000" max_interval="30000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500" />
    ...
    <PEER_LOCK/> 
</config>

We perform lock/unlock as follows:

Lock lock = getLockService().getLock("mylock");
try
{
   lock.tryLock();
   //do something
}
finally
{
   lock.unlock();
}

We are expecting false failure detection several times a day, probably because of too low timeout value of FD. What is worse, we often have several locks hanging forever if they were obtained during such false FD.

scenario is this:

  1. We have a cluster view of {A,B|1}
  2. Wait until failure detected, but both nodes are alive (false FD).
  3. Node A will suspect Node B and create new view, {A|2}
  4. Suspected Node B will be still in view {A,B|1}.
  5. Node B is trying to obtain a lock "mylock".
  6. Node A discards grant lock messages from Node B, as it is in different view.
  7. View merge is performed, and new view is created - {A,B|3}

Problem: a thread which try to get "mylock" hangs in lock.tryLock(); line, each subsequent attempts to get "mylock" fail as well.

We have used tryLock(long time, TimeUnit unit) with timeout specified, and seems it solved the problem.

Question: Does it means that JGroups impl. of Lock.tryLock() without timeout have a bug and should be avoided?

Thanks.

zxeka
  • 43
  • 3

1 Answers1

0

Additionally to timeout increase and using tryLock() with timeout, it would be better to change PEER_LOCK to CENTRAL_LOCK. Please see details here: https://community.jboss.org/message/827520

zxeka
  • 43
  • 3