We are planning to create a new processing mechanism which consists of listening to a few directories e.g: /opt/dir1, /opt/dirN
and for each document create in these directories, start a routine to process, persist it's registries in a database (via REST calls to an existing CRUD API) and generate a protocol file to another directory.
For testing purposes, I am not using any modern (or even decent) framework/approach, just a regular SpringBoot app with WatchService implementation that listens to these directories and poll the files to be processed as soon as they are created. It works but, clearly I am most definitely having some performance implications at some time when I move to production and start receiving dozens of files to be processed in parallel, which isn't a reality in my example.
After some research and some tips from a few colleagues, I found Spring Batch + Spring Cloud Data Flow to be the best combination for my needs. However, I have never dealt with neither of Batch or Data Flow before and I'm kinda confuse on what and how I should build these blocks in order to get this routine going in the most simple and performatic manner. I have a few questions regarding it's added value and architecture and would really appreciate hearing your thoughts!
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
How can I keep some variables externalized for my task
e.g directories path
? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
Also, if any of you have any guide or sample about how to launch a task programmaticaly, I would really thank you! I am still searching for that, but doesn't seem I'm doing it right.
Thank you for your attention and any input is highly appreciated!
UPDATE
I managed to launch my task via SCDF REST API so I could keep my original SpringBoot App using WatchService launching a new task via Feign or XXX. I still know this is far from what I should do here. After some more research I think creating a stream using file source and sink would be my way here, unless someone has any other opinion, but I can't get to set the inbound channel adapter to poll from multiple directories and I can't have multiple streams, because this platform is supposed to scale to the point where we have thousands of particiants (or directories to poll files from).