I have a system that reads large files on FTP server, store them on database, and send them to an API that does some processing with that data. I have to do chunking of the data because it may be hundreds of thousands of registers and the processing takes a while. That processing it's done on jobs, so I batched those jobs to know when a file is done and continue with the next. The thing here is that I'm asked to automate the process of checking if there's a new file so a scheduler does that checking and starts the long process. I did the task and programmed it to run every 5 minutes, but the previous job will take longer, that's how I know it won't wait to the first task to end.I thought the withoutOverlapping would prevent that from happening, but as it didn't work, I don't know if there's a way to achieve that.
1 Answers
It sounds like you are using a task scheduler to check for new files and start the data processing jobs. However, you are running into issues where the previous job may still be running when the scheduler starts the next job, leading to overlapping and potential data processing errors.
One solution to this issue would be to use a file lock to prevent the scheduler from starting a new job while the previous one is still running. When the processing job starts, it acquires a file lock on a specific file, and the scheduler checks if this file lock is still in place before starting a new job. If the lock is still there, the scheduler waits until the lock is released before starting the next job.
Another solution could be to use a queue system to manage the data processing jobs. Instead of starting the jobs directly from the scheduler, you could enqueue them into a queue system, such as RabbitMQ or Apache Kafka. The processing jobs would then be picked up by worker processes, which can be scaled up or down as needed to handle the volume of incoming jobs. This way, you can ensure that jobs are processed in the order they were received, without the risk of overlapping or errors.
Overall, using either a file lock or a queue system can help you automate the process of checking for new files and starting the data processing jobs, while also ensuring that the jobs are executed in a controlled and safe manner.

- 2,927
- 3
- 20
- 33
-
1Thanks! I saw this late but yeah, I used a lock file to avoid starting the whole process again and it seems to work, still in testing but not causing trouble so far. – daxez Feb 28 '23 at 02:32