I am looking for guidance in the design approach for resolving one of the problems that we have in our application.
We have scheduled jobs in our Java application and we use Quartz scheduler for it. Our application can have thousands of jobs that do the following:
- Scan a folder location for any new files.
- If there is a new file, then kick off the associated workflow to process it.
The requirement is to:
- Process only new files.
- If any duplicate file arrives (file with the same name), then don't process it.
As of now, we persist the list of the processed files at quartz job metadata. But this solution is not scalable as over the years (and depending on number of files received per day which could be range from 100K per day), the job metadata (that persist list of files processed) grows very large and it started giving us problems with data truncation error (while persisting job metadata in quartz table) and slowness.
What is the best approach for implementing this requirement and ensuring that we don't process duplicate files that arrive with the same name? Shall we consider the approach of persisting processed file list in the external database instead of job metadata? If we use a single external database table for persisting list of processed files for all those thousands of jobs then the table size may grow huge over the years which doesn't look the best approach (however proper indexing may help in this case).
Any guidance here shall be appreciated. It looks like a common use case to me for applications who continuously process new files - therefore looking for best possible approach to address this concern.