1

I have a use-case for which I could use spring batch job which I could design in following ways.

1) First Way:

Step1 (Chunk oriented step): Read from the file —> filter, validate and transform the read row into DTO (data transfer object), if there are any errors, store errors in DTO itself —> Check if any of the DTOs has errors , if not write to Database. If yes, write to an error file.

However, problem with this way is - I need this entire JOB in transaction boundary. So if there is a failure in any of the chunks then I don’t want to write to DB and want to rollback all successful writes till that point in DB. Above way forces me to write rollback logic for all successful writes if there is a failure in any of the chunks.

2) Second way

Step 1 (Chunk oriented step): Read items from the file —> filter, validate and transform the read row in DTO (data transfer object). This does store the errors in the DTO object itself.

Step 2 (Tasklet): Read entire list (and not chunks) of DTOs created from step 1 —> Check if any of the DTOs has errors populated in it. If yes, then abort the writing to DB and fail the JOB.

In second way, I get all benefits of chunk processing and scaling. At the same time I have created transaction boundary for entire job.

PS: In both ways in their first step there won’t be any step failure, if there is failure; errors are stored in DTO object itself. Thus, DTO object is always created.

Question is - Since I am new to Spring batch, is it a good pattern to go with second way. And is there a way that I can share data between steps so that entire List of DTOs is available to second step (in second way above) ?

Sagar
  • 5,315
  • 6
  • 37
  • 66
  • You shared your solution(s) but not your actual requirement. What is you actual problem? Can you explain it with an example of input/output? How would you solve it *without* Spring Batch? – Mahmoud Ben Hassine Jun 02 '21 at 07:28
  • FTR, there is no built-in way to get a transaction for an entire job, see https://stackoverflow.com/questions/19031186/spring-batch-one-transaction-over-whole-job. This is by design and not the goal in the first place, but nothing prevents you from creating a custom implementation of the `Job` interface with that (even though this would be a bad idea as it would lead to long running transactions that lock the tables for the whole duration of the job). If you clearly define your requirement, I can try to help with some guidelines about your suggestions and other alternatives as well. – Mahmoud Ben Hassine Jun 02 '21 at 08:01
  • @MahmoudBenHassine Thank you very much for looking at this issue. Well, let me try to explain the problem. User uploads a file, I read file and validate, filter and transform each row in the DTO. While doing that whatever validation, filteration and transformation error occur, I store them in that DTO only (DTO has List errors in it). Once final list of DTOs is ready I store them in DB only if ALL DTOs are error free, even if single DTO has error then I want to abort whole operation. Currently this is a synch operation, user upload and waits for the result. – Sagar Jun 02 '21 at 13:09
  • I am making this whole process Async. User uploads a file and then I backend triggers a Spring batch job and then he/she receives a result later via email. Now I have two options explained in the questions. I am leaning towards a second way BUT since I am new to Spring batch want to make sure it's a valid pattern. First way above needs me to write Rollback logic which I am trying to avoid. – Sagar Jun 02 '21 at 13:15
  • There will be multiple uploads from different users, backend will trigger spring batch job for each upload as they come in the queue. – Sagar Jun 02 '21 at 13:28
  • Also, I need to process entire file always. So that I can report errors in the entire file to user so that he can correct it in one go next time he uploads! – Sagar Jun 02 '21 at 13:33
  • Thanks for the updates. If I understand correctly, you are trying to implement a all-or-nothing semantics for the uploaded file, ie either all lines are valid and in which case insert all lines in the db, otherwise nothing should be inserted in the db and errors should be reported to the user. Is that correct? If yes, then where should errors be persisted? In a file, another table? This is key because you said the process is async so you need to store errors somewhere until you send the report to the user. – Mahmoud Ben Hassine Jun 03 '21 at 07:14
  • Yes, you got it correctly. For, errors - Yes they will be written to the file. If no errors then write to DB else write errors to a file, the error file will be stored to S3 or local (haven't decided where yet though). – Sagar Jun 03 '21 at 11:07

1 Answers1

2

In my opinion, trying to process the entire file in a single transaction (ie a transaction at the job level) is not the way to go. I would proceed in two steps:

  • Step 1: process the input and writes errors to the file
  • Step 2: this step is conditioned by step1. If no errors has been detected in step 1, then save data to the db.

This approach does not require to write data to the database and roll it back if there are errors (as suggested by option 1 in your description). It only writes to the database when everything is ok.

Moreover, this approach does not require holding a list of items in-memory as suggested by option 2, which could be inefficient in terms of memory usage and performs poorly if the file is big.

Mahmoud Ben Hassine
  • 28,519
  • 3
  • 32
  • 50
  • where the step 2 will read transformed items (DTOs) from?, don't they need to be in memory so that Step 2 can read them and write into the DB? Also, I understand your concern of memory inefficiency if we were to keep all items in memory but, I am going to make sure Spring job is triggered only when there is enough memory capacity on server. As soon as capacity is available, spring batch job will be triggered for next upload in the queue. – Sagar Jun 04 '21 at 14:29
  • Also - Are you suggesting both steps in your answer to be chunk oriented? – Sagar Jun 04 '21 at 14:35
  • `don't they need to be in memory so that Step 2 can read them and write into the DB?` Not necessarily. If they don't fit in memory, they can be in an intermediate persistent storage, like a file or temporary table. The problem with holding them in memory is that you can't know upfront how much data you will need to keep in memory (unless you do the math beforehand). Yes, both steps are chunk-oriented. The second step could be a tasklet if it copies a temporary table in non-temporary table if you choose that route. – Mahmoud Ben Hassine Jun 07 '21 at 12:28
  • Thank you very much for your time and response. I appreciate it! – Sagar Jun 07 '21 at 15:07