0

We have a java process, which listen's to a directory X on the file system using apache commons vfs. Whenever a new file is exported to this directory, our process kicks in. We first rename the file to filename.processing and parse the file name, get some information from the file and insert into tables, before sending this file to a Document management system. This is a single-threaded application per cluster. Now consider this running in a cluster environment, we have 5 server's. So 5 different VM's are trying to get access of the same file. The whole implementation was on the basis that only one process can rename the file to .processing at a given time, as OS will not allow multiple processes modifying the file at the same time. Once a cluster get's holds and renames file to .processing, other cluster's will ignore files which are of format .processing.

This was working fine since more than a year, but just now we found few duplicates. It looks like multiple cluster's got hold of the file, in this case say cluster a, b, c have got access of the file f.pdf and they renamed it to f.pdf.processing at the same time,(i am still baffled how OS allows modifying the file at the same time). As a result of these, cluster a,b,c they processed the file and send it to document management system. So now there are 3 duplicate files.

So in short what i am looking at is, approaches to run task only once in a cluster environment. I also want it to have a failover mechanism, so that if something went wrong with the cluster, another cluster picks up the task. We don't want to set env variable, like master=true on a box, as that will limit it to only one cluster and will not handle failover.

Any kind of help is appreciated.

Maverick Riz
  • 2,025
  • 5
  • 19
  • 23
  • Is there a tl;dr version? – Jean Logeart Dec 11 '15 at 22:06
  • Does your application process many small files, or few large ones? I mean is it possible to wait for some amount of time before processing started or will it cause performance issues? – user3707125 Dec 11 '15 at 22:06
  • This seems like a very error prone approach even if done correctly. You should look at Amazon SQS to see how they manage tasks and either use that or implement something similar. – nikdeapen Dec 11 '15 at 23:25
  • @JeanLogeart Sorry i did not get you? Which version are you referring to? – Maverick Riz Dec 13 '15 at 22:23
  • @user3707125 Many small files . We thought about waiting for some time, like setting random no. of sec's as sleep time, as each cluster will have a differet number. This will definitely lower the chances of running into the issue, but there are still chances of the race condition. – Maverick Riz Dec 13 '15 at 22:25
  • @nikdeapen Thank you for the advice. Unfortunately our management does not want to use Amazon SQS or re-implement the whole process at this time. – Maverick Riz Dec 13 '15 at 22:26

3 Answers3

1

See the following post about file locking: How do filesystems handle concurrent read/write?

Read and write operations (that includes renaming) of files are not atomic and not well-synchronized between processes, as you assumed - at least not so on most operating systems.

However, creating a new file is usually an atomic operation. You can use that to your advantage. The concept is called whole-file-locking.

Community
  • 1
  • 1
Andreas Vogl
  • 1,776
  • 1
  • 14
  • 18
  • Thank you, how can i acquire whole-file-locking using java code? In the above answer, Viacheslav has secified VFSUtils, in which there is an acquire lock method. I wonder if this will do the job? https://synapse.apache.org/apidocs/org/apache/synapse/transport/vfs/VFSUtils.html – Maverick Riz Dec 13 '15 at 22:43
1

We are implementing our own synchronization logic using a shared lock table inside the application database. This allows all cluster nodes to check if a job is already running before actually starting it itself.

Maverick Riz
  • 2,025
  • 5
  • 19
  • 23
0

Do you try to use FileLock tryLock() or lock(), before rename file to .processing? If you didn't, I think you should try, so in this case only one application can allowed to change this file.

Update: Sorry, I forgot that you ask about VDF. In Apache VDF (in fact, in Apache Synapse) I found VFSUtils class, that have following method:

public static boolean acquireLock(org.apache.commons.vfs2.FileSystemManager fsManager,
                                  org.apache.commons.vfs2.FileObject fo)

Acquires a file item lock before processing the item, guaranteing that the file is not processed while it is being uploaded and/or the item is not processed by two listeners
Parameters:
   fsManager - used to resolve the processing file
   fo - representing the processign file item
Returns:
   boolean true if the lock has been acquired or false if not

I think, that method can solve your problems (if you can use Apache Synapse in your project).

Slava Vedenin
  • 58,326
  • 13
  • 40
  • 59
  • How can one apply `FileLock` to Apache VFS? – user3707125 Dec 11 '15 at 22:16
  • @ViacheslavVedenin Thanks for looking into this. I will try your solution and will let you know if this worked for me. I need to add the Synapse dependency for this, its surprising that this functionality is not part of apache vfs. So just for my understandng, if one cluster acquires this lock, then when other cluster tries to acquire i am hoping this method will return false? – Maverick Riz Dec 13 '15 at 22:37