10

My file processing scenario is ,

 read input file -> process -> generated output file

but i have to two physically different machines which are connected to one storage area where i receive all the input files and one database server,There are two application servers running on these machine(1 on each server).

enter image description here

so how can i use spring batch to process input files on both these application server parallelly ? i mean if there are 10 files the 5 on server1 (P1) and 5 on (P2) ,can it be done ?

neel.1708
  • 315
  • 1
  • 4
  • 18
  • generated output file = write the result in the Database?, or the database is only used for spring batch metadata? and you actually write back output files to your file system box? – Cygnusx1 May 02 '13 at 13:01
  • yes , I have to generated out put file on the file system , DB is used to store the input file details and after processing those details i have to generate out put file. – neel.1708 May 02 '13 at 17:02
  • If there's is no dependencies between your files, i don't see why you could not do this. The only thing you have to check is to avoid processing the same file in both jobs!! But this would be the responsibility of the caller... How do you start your jobs ? Schedular? a ksh? – Cygnusx1 May 02 '13 at 19:46
  • Yes thats the main concern that avoid processing same file on both server and we are planning to use schedular to trigger job after every 20 min but this sceduler will be on both the application server so how to avoid same file processing on both the server? using DB flag column or is there any cleaner approach ?. – neel.1708 May 03 '13 at 05:38
  • Hi. Have you find any solution for this? – ruhungry Feb 06 '15 at 14:47
  • As Jimmy Praet answered, if the file-path is unique and you use it as a job parameter, Spring Batch will take care to not execute one job twice. This is the clean solution to follow, as you no longer care what server executes the file. – Cristian Sevescu Feb 23 '15 at 16:03
  • Yes file-path is unique but i have two application EAR (Which are same) running on two different machines which trigger these jobs,so i guess i will still need some locking mechanism. – neel.1708 Mar 27 '15 at 11:00

4 Answers4

4

You could schedule a job per input file (input file location would be a parameter of the job). Spring Batch will guarantee no two job instances with the same job parameters are created. You'll get a JobExecutionAlreadyRunningException or JobInstanceAlreadyCompleteException if the other node has already started processing the same file.

Jimmy Praet
  • 2,230
  • 1
  • 15
  • 14
  • This is the solution for the rather classical batch problem opened here. – Cristian Sevescu Feb 23 '15 at 16:06
  • Will i get JobExecutionAlreadyRunningException this exception if jobs are running from two different machines ? Because my application EAR will be deployed on both these machines which triggers these jobs. – neel.1708 Mar 27 '15 at 10:47
1

The first thing would be to decide whether you actually want to split the files in half (5 and 5), or do you want each server processing until it's done? If the files are various sizes with some small and others larger, you may end up with optimal parallelization having 6 processed on one server and 4 on the other, or 7 and 3, if the 3 take as long as the other 7 because of differences in size.

A very rudimentary way would be to have a database table that could represent active processing. Your job could read the directory, grab the first file name, and then insert into the table that it was being processed by that JVM. If the primary key of the table is the filename, then if they both try at the same time, one would fail and one would succeed. The one that succeeds at inserting the entry in the table wins and gets to process the file. The other has to handle that exception, pick the next file, and attempt to insert it as a processing entry. This way each essentially establishes a centralized lock (in the db table), and you get more efficient processing that considers file size rather than even file distribution.

IceBox13
  • 1,338
  • 1
  • 14
  • 21
0

Here are my suggestions:

  • create a locking table in db with file path as primary key. Then try to insert a record with this key - if succeeds, your code can continue and process the file, if fails (exception, that record with this primary key exists), then go to next file.

  • precise scheduling, as mentioned earlier by Jimmy

  • you can try to use a queue (like ActiveMQ, RabittMQ, ...) to synchronize your machines

Michal
  • 614
  • 8
  • 20
-1

There's a pretty simple way of doing it. If i get it correct you put everyfile in database (some of info about it) and then remove to create a new output. You can Lock() on it, Before reading file u check

  for(File file : fileList.getFiles())
    try{
      (getting file + process it)
       }

and in process

     file.lock();
     try {
         ...
     } finally {
         file.unlock();
     }

Here is some information about Lock.

  • 1
    will this lock work if two JVM's are involved ? because there are tow different machines are involved which are not connected. – neel.1708 May 20 '13 at 06:19