Building a File Polling/Ingest Task with Spring Batch and Spring Cloud Data Flow

Question

We are planning to create a new processing mechanism which consists of listening to a few directories e.g: /opt/dir1, /opt/dirN and for each document create in these directories, start a routine to process, persist it's registries in a database (via REST calls to an existing CRUD API) and generate a protocol file to another directory.

For testing purposes, I am not using any modern (or even decent) framework/approach, just a regular SpringBoot app with WatchService implementation that listens to these directories and poll the files to be processed as soon as they are created. It works but, clearly I am most definitely having some performance implications at some time when I move to production and start receiving dozens of files to be processed in parallel, which isn't a reality in my example.

After some research and some tips from a few colleagues, I found Spring Batch + Spring Cloud Data Flow to be the best combination for my needs. However, I have never dealt with neither of Batch or Data Flow before and I'm kinda confuse on what and how I should build these blocks in order to get this routine going in the most simple and performatic manner. I have a few questions regarding it's added value and architecture and would really appreciate hearing your thoughts!

I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?

Also, if any of you have any guide or sample about how to launch a task programmaticaly, I would really thank you! I am still searching for that, but doesn't seem I'm doing it right.

Thank you for your attention and any input is highly appreciated!

UPDATE

I managed to launch my task via SCDF REST API so I could keep my original SpringBoot App using WatchService launching a new task via Feign or XXX. I still know this is far from what I should do here. After some more research I think creating a stream using file source and sink would be my way here, unless someone has any other opinion, but I can't get to set the inbound channel adapter to poll from multiple directories and I can't have multiple streams, because this platform is supposed to scale to the point where we have thousands of particiants (or directories to poll files from).

score 3 · Accepted Answer · answered Mar 30 '18 at 13:52

Here are a few pointers.

I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?

If you'd have to launch it automatically upon an upstream event (eg: new file), yes, you could do that via a stream (see example). If the events are coming off of a message-broker, you can directly consume them in the batch-job, too (eg: AmqpItemReader).

If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?

Hopefully, the above example clarifies it. If you want to programmatically launch the Task (not via DSL/REST/UI), you can do so with the new Java DSL support, which was added in 1.3.

How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?

The recommended approach is to use Config Server. Depending on the platform where this is being orchestrated, you'd have to provide the config-server credentials to the Task and its sub-tasks including batch-jobs. In Cloud Foundry, we simply bind config-server service instance to each of the tasks and at runtime the externalized properties would be automatically resolved.

Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?

Ad a replacement for Spring Batch Admin, SCDF provides monitoring and management for Tasks/Batch-Jobs. The executions, steps, step-progress, and stacktrace upon errors are persisted and available to explore from the Dashboard. You can directly also use SCDF's REST endpoints to examine this information.

Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?

This is implementation specific. We do not have any benchmarks to share. However, if performance is a requirement, you could explore remote-partitioning support in Spring Batch. You can partition the ingest or data processing Tasks with "n" number of workers, so that way you can achieve parallelism.

Hi Sabby! Thank you so much for your input. Regarding using Stream instead of Task, unfortunately I didn't manage to build a poller that scans multiple directories, so I'm using Tasklets for now. — Enrico Bergamo, Apr 02 '18 at 11:06
Regardless that, my main concern with Spring Cloud Config is: how can I ensure that devs don't have access to the git repo which lies the config for Staging and Prod envs? Is there any other approach to assign a new properties file at runtime? I currently use `--spring.config.location=classpath:/application.yml,file:/tmp/config/application.yml` for my Dockerized applications know how to override the default properties file, so each env has it's own .yml file maintained by Operations without devs knowing about it's credentials. Thanks! — Enrico Bergamo, Apr 02 '18 at 11:09
you separate the config files by spring profiles (eg name of stage) and branches can be used for versioning. The support for this is built in — Dragoslav Petrovic, Dec 14 '21 at 11:02

Building a File Polling/Ingest Task with Spring Batch and Spring Cloud Data Flow

1 Answers1