0

I am experiencing a weird behavior from File::NFSLock in Perl v5.16. I am using stale lock timeout option as 5minutes. Let's say I have three processes. One of them took more than 5minutes before releasing a lock and process 2 got lock. However, even process 2 has lock for less than 5minutes, 3rd process is coming and removing the lock file causing the 2nd process to fail while removing NFSLock held by itself.

My theory says that process 3 wrongly read the last modified time of lock as that of written by process 1 and not process 2. I am writing nfs lock on partitions mounted on NFS.

Does anyone has an idea or faced similar issue with perl NFSLock? Please refer the below snapshot

my $lock = new File::NFSLock {file      => $file,
                              lock_type => LOCK_EX,
                              blocking_timeout   => 50,     # 50 sec
                              stale_lock_timeout => 5 * 60};# 5 min

$DB::single = 1;
if ($lock) {
    $lock->unlock()
}

If I block at debugger point for process 1 for more than 5 minutes, I am observing this behavior

Deepanshu Arora
  • 375
  • 1
  • 5
  • 21

1 Answers1

1

From reviewing the code at
https://metacpan.org/pod/File::NFSLock
I see that the Lock is implemented just by a physical file in the system.
I work in almost every project with the same logic of process lock.

With the Process Lock it is crucial not to set the stale_lock_timeout too tight.
Or it will occur a "Race Condition" as it is also mentioned in In-Code Comments.

As you mentioned the 3 processes start to compete over the same Lock because the Job takes > 5 min and you set the tale_lock_timeout to 5 min.
If you have a fix time giver like the crond Service this will launch a process every 5 min. Each process will take the Lock as outdated because 5 min already passed although the process takes more than > 5 min

To describe a possible scenario:
Some DB Job takes 4 min to complete but on a congested system can take up to 7 min or more.
Now if the crond Service launches a process every 5 min
At 0 min the first process process1 will find the Job as new and set the Lock and start the Job which will take up to 7 min.
Now at 5 min the crond Service will launch process2 which finds the Lock of process1 but decides that it is already stale because it's already 5 min since the Lock was created and it will be taken as stale. So process2 releases the Lock and reaquires it for itself.
Later at 7 min process1 has already finshed the Job and without checking if it is still his Lock it releases the Lock of process2 and finishes.
Now at 10 min process3 is launched and does not find any Lock because the Lock of process2 was already released by process1 and sets its own Lock.
This scenario is actually really problematic because it leads to a process accumulation and workload accumulation and unpredictable results.

The Suggestion to fix this issue is:

  1. Set stale_lock_timeout to an amount far bigger than what would take the Job (like 10 min or 15 min). The stale_lock_timeout but be bigger than the execution time schedule.
  2. Set the execution schedule more spacious to give enough time to each process to finish its task (each 10 min or each 15 min)
  3. Consider integrating the Job of process1, process2 and process3 into one only process_master which launches each process after the former onces are finished.
  • Re "*Later at 7 min process1 has already finshed the Job and without checking if it is still his Lock it releases the Lock of process2 and finishes.*", That sounds like a big to me! – ikegami Feb 26 '20 at 15:58
  • Re "*This scenario is actually really problematic because it leads to a process accumulation*", Yeah, the stale job should get killed – ikegami Feb 26 '20 at 16:00
  • In my case , you can assume process3 runs with a different code and only can remove the lock. process2 waited for the process1 to release the lock and started execution and process3 trigerred at exact moment adn remove d the lock which belonged to process2 – Deepanshu Arora Feb 27 '20 at 06:01
  • @ikegami as the scenario the job is not stale actually. it just takes in some circunstances longer than expected. the `stale_timeout` is **too small** and does not correspond to the real job time execution window – Bodo Hugo Barwich Feb 27 '20 at 17:59
  • @DeepanshuArora the system as you describes `process3` only can remove the lock and requires `process1` and `process2` to be finished asks for the `process_master` architecture which launches `process1`, `process2` and finally `process3`. But anyways, Did you already try to set the `stale_timeout` **bigger** as I suggested? – Bodo Hugo Barwich Feb 27 '20 at 18:06
  • @DeepanshuArora Again when we continue to analyse the described system does it make sense to execute `process3` on a failed `process2` if it relies on that `process1` and `process2` did succeed? Again the `process_master` architecture might make the system work. – Bodo Hugo Barwich Feb 27 '20 at 18:12
  • @Bodo Hugo Barwich, Re "*as the scenario the job is not stale actually.*", I said stale, not hung. The job is stale if it's still running after `stale_timeout` seconds by definition. – ikegami Feb 27 '20 at 18:15
  • @Bodo Hugo Barwich, Re "*the `stale_timeout` is too small*", That's irrelevant to what I was saying. Even if you make `stale_timeout` larger, you should still kill the existing job if you start another instance of it! – ikegami Feb 27 '20 at 18:19
  • @ikegami an indifferent `kill process2` can only be advised if all processes are instances of the same script producing the same fully reproduceable result (like some "*is-alive*" check). But in the use case that they are part of an **incremental**, **progressive** job (like a Database Backup or Huge File Download) `kill process2` will result in a **faulty and corrupt result** of the **whole job**. Therefore it is not my first answer. – Bodo Hugo Barwich Feb 27 '20 at 21:13