Maintain list of processed files to prevent duplicate file processing

Question

I am looking for guidance in the design approach for resolving one of the problems that we have in our application.

We have scheduled jobs in our Java application and we use Quartz scheduler for it. Our application can have thousands of jobs that do the following:

Scan a folder location for any new files.
If there is a new file, then kick off the associated workflow to process it.

The requirement is to:

Process only new files.
If any duplicate file arrives (file with the same name), then don't process it.

As of now, we persist the list of the processed files at quartz job metadata. But this solution is not scalable as over the years (and depending on number of files received per day which could be range from 100K per day), the job metadata (that persist list of files processed) grows very large and it started giving us problems with data truncation error (while persisting job metadata in quartz table) and slowness.

What is the best approach for implementing this requirement and ensuring that we don't process duplicate files that arrive with the same name? Shall we consider the approach of persisting processed file list in the external database instead of job metadata? If we use a single external database table for persisting list of processed files for all those thousands of jobs then the table size may grow huge over the years which doesn't look the best approach (however proper indexing may help in this case).

Any guidance here shall be appreciated. It looks like a common use case to me for applications who continuously process new files - therefore looking for best possible approach to address this concern.

Do you have control over the process which adds files to the folder? If so, this process can notify the system of the files which were added. Another idea would be to distinguish files based on their creation timestamp. — Luke Bajada, Apr 03 '17 at 17:24
You'll have to keep a list of files somewhere, database or some sort of file seems an appropriate place to store. Databases are efficient and it shouldn't be any sort of performance problem. — Nicholas Hirras, Apr 03 '17 at 17:24

score 0 · Answer 1 · edited May 23 '17 at 12:02

0

If not processing duplicate files is critical for you, the best way to do it would be by storing the file names in a database. Keep in mind that this could be slow since you would be query for each file name, or have a large query for all the new file names.

That said, if you're willing to process new files which may be a duplicate, there are a number of things that can be done as an alternative:

Move processed files to another folder, so that your folder will always have unprocessed files
Add a custom attribute to your processed files, and process files that do not have that attribute. Be aware that this method is not supported by all file systems. See this answer for more information.
Keep a reference to the time when your last quartz job started, and process new files which were created after that time.

edited May 23 '17 at 12:02

Community

1
1

answered Apr 03 '17 at 17:38

Luke Bajada

1,737
14
17

Thanks for your answer! As you mentioned, adding a custom attribute is not supported by all file system - so that may not be an acceptable solution for our application. Also, APIs for reading file attributes like creation time, modified time - aren't they dependent on OS and file system? – Aman Apr 03 '17 at 18:01
Common file attributes like those should be available across all file systems. Would the last option work for you? – Luke Bajada Apr 03 '17 at 18:03
No, we need the ability to check for duplicate file names. Is there any other option instead of persisting the file names in database table? – Aman Apr 03 '17 at 20:47
I'm assuming that a duplicate file name will overwrite the one before it since there can't be two files with the same name in the same folder. Because of that, you will need a master list of all the file names and using databases would be the best way in terms of scalability. The bright side is that if you use my third option, you will only get to compare a small subset (100k according to you) instead of all the files. – Luke Bajada Apr 03 '17 at 20:50

Maintain list of processed files to prevent duplicate file processing

1 Answers1