3

We have Spring Cloud Data Flow running in Kubernetes in order to orchestrate Spring Batch jobs. For each new file we have in, Spring Cloud Data Flow spins up a new Spring Batch task.

Spring Batch accesses database and uses the connection pool, holding (by default) 10 connections to database. That limits us the number of jobs that we can run at the same time, going against scalability principles. Only solutions we've found so far are are:

  • Reduce the Spring Batch connection pool: we cannot reduce it too much since we apply multithreading.
  • Increase the max number of connections in the database: it does not scale.

We were wondering whether there is any way of delegating the interaction of the Spring Batch database tables to Spring Cloud Data Flow through API.

Thanks.

alvgarvilla
  • 1,026
  • 12
  • 25
  • 1
    That's not specific to Spring Batch or SCDF per se, it is true for every design where a central database is shared between different processes. Shared state is the most limiting factor for scalability. That said, you should be able to scale your batch infrastructure even with a shared database if you correctly configure it (finding the best pool size is an empirical process, try and see which value fits your needs) . What do you mean by "delegating the interaction of the Spring Batch database tables to Spring Cloud Data Flow through API"? – Mahmoud Ben Hassine Jun 22 '20 at 11:34
  • @MahmoudBenHassine: Let me help the OP add some context We are trying to have a k8s system where "dynamic pod" can be spin off by SCDF, perform the job, then get killed. Each of the pods is a Spring Batch. The point of the "pod" is to have multiple lightweight batch process to import data. However, each pod now have their own DB connection pools to write the "Spring Batch" tables, hence the PostGreSQL connection pools is used up quickly if multiple pods running at the same time. What the OP is trying to achieve is to make SCDF handle these DB connections instead of the individual pod. – Hoàng Long Jun 22 '20 at 14:48
  • 1
    It's not SCDF's concern to manage this. As far as SCDF is concerned, it launches a task which will be scheduled on a given node of the cluster. Now if this task requires a connection to the database, it should handle it itself. `each pod now have their own DB connection pools`: why does each pod need an entire pool? If you run each job in a pod, a single connection from your central db pool is enough. If you really need an entire pool per job (like for multi-threading), you really need to increase your central pool to accept that many connections, but again this is inherent to the design. – Mahmoud Ben Hassine Jun 22 '20 at 15:22
  • Thanks for the response @MahmoudBenHassine. We use multi-threading so we need the entire pool for processing. We can increase our central pool, but that might result in scalability issues and limit ourselves on the number of jobs that we can run simultaneously. Are you aware of the interface that spring-batch is using to access the repository layer? we might be able to create our own implementation of it accessing another service which will handle the database interactions. – alvgarvilla Jun 22 '20 at 17:24
  • `that might result in scalability issues`: I would try and see, as it is possible that it might **not** result in scalability issues. I leave it to you, but I would not worry about this problem until it really happens. – Mahmoud Ben Hassine Jun 23 '20 at 07:36
  • Hi @MahmoudBenHassine we are already suffering them without increasing the max number of connections of PostgreSQL. We are increasing them but still, based on our estimations, we will have scalability issues. For now it will work but I'm not confident that the fact that every spring batch holds a number of connections is suitable for scalable architectures. – alvgarvilla Jun 26 '20 at 07:35
  • ok it's up to you. Again, back to my first comment, that's not specific to Spring Batch or SCDF. Remove spring batch from the picture and replace it with plain java code or anything else that uses multiple threads each requiring a db connection and you will have the same problem. Anyway, I tried to help by giving the current state of things. Good luck! – Mahmoud Ben Hassine Jun 26 '20 at 07:49

0 Answers0