1

I need to implement a batch job that splits an XML, processes the parts and aggregates them afterwards. The aggregate needs to be processed further.

The processing of account parts is really expensive. In the following image the processing part takes action for each account of a person (the number of accounts of a person varies btw).

structure of the XML

Sample input file:

<?xml version="1.0" encoding="UTF-8"?>
<Persons>
    <Person>
        <Name>Max Mustermann</Name>
        <Accounts>
            <Account>maxmustermann</Account>
            <Account>esel</Account>
            <Account>affe</Account>
        </Accounts>
    </Person>
    <Person>
        <Name>Petra Pux</Name>
        <Accounts>
            <Account>petty</Account>
            <Account>petra</Account>
        </Accounts>
    </Person>
        <Person>
        <Name>Einsiedler Bob</Name>
        <Accounts>
            <Account>bob</Account>
        </Accounts>
    </Person>
</Persons>

For each Account of each Person do the following: Invoke a REST service, say for instance

GET /account/{person}/{account}/logins

As result invoke a rest service for each Person containing the aggregated logins-xml:

POST /analysis/logins/{person}

<Person>
    <Name>Max Mustermann</Name>
    <Accounts>
        <Account>
            <LoginCount>22</LoginCount>
            <Name>maxmustermann</Name>
        </Account>
        <Account>
            <LoginCount>42</LoginCount>
            <Name>esel</Name>
        </Account>
        <Account>
            <LoginCount>13</LoginCount>
            <Name>affe</Name>
        </Account>
    </Accounts>
</Person>

I don't have any influence on the APIs, so i need to update person bundles.

How could I realize the parallelization and therefore structure my spring batch application?

I've found some starting points, but none of them was really satisfying.

Should I process the account data in one step, return an item belonging to every account and poll in the next step for each item and aggregate those or should I implement parallelization in the account step and aggregate it inside this step, introducing a next step for further processing?

Problem with first approach: how should I know all items have arrived to start aggregation?

Problem with (or rather question about) second approach: is it common to realize parallelization manually (like for instance Java Futures) instead of leaving it to spring batch?

What's the Spring Batch kind of way?

Thanks, Thomas

thomas
  • 115
  • 1
  • 7

1 Answers1

1

You have not explained many things in your question like - your starting points and why were those not satisfying?

Also, you should have shown a sample input and output xml. Your last paragraph is very cryptic to read & understand so please write that better.

Having said that, I guess in terms of Java you have input of type List<Person> where Person is like -

public class Person{ private List<Account> accounts; .... .... }

I still don't know your desired output and not sure what do you mean by aggregate?

Please explain these points and I will revise my answer.

As far as I understand about aggregation requirement, Spring batch already writes in chunks so that part should already be taken care of without any explicit aggregation.

If I have understood your input structure and processing need correctly, you should utilize Spring Batch Partitioning where you partition either your List<Person> ( which are read records from XML) or you can directly partition on XML (Using split command and SystemCommandTasklet) or you can divide your XML into multiple smaller xmls and can use MultiResourcePartitioner

Its not necessary to start all partitioned steps in one go as you can start a small group of partitioned steps in parallel by using concurrencyLimit of TaskExecutor.

Various strategies can be worked out to write to a single file by multiple threads or writing to multiple files by multiple threads etc.

Standard Spring Batch way would be two read x number of persons in y parallel threads (i.e. x persons for each thread ) , process each person in parallel and then write in chunks to single target or multiple targets.

Divide total persons in N - Partitions, invoke REST services in ItemProcessor for each person and prepare target Person object in processor. Desired number of output persons would be sent to writer as chunk-size setting. Mark slave Step components ItemReader, ItemProcessor and ItemWriter in step @StepScope and each slave step would run in its own thread.

Hope it helps and let me know what specific challenges you face.

Community
  • 1
  • 1
Sabir Khan
  • 9,826
  • 7
  • 45
  • 98