spring batch multi thread file reading

Question

In a Spring Batch, I am trying to read a CSV file and want to assign each row to a separate thread and process it. I have tried to achieve it by using Task Executor, it is working if i am not getting file name using job parameter. If I get through job parameters since the scope="step" all threads are reading the same line from the file. whether it will be resolved if I change the scope="job" if yes please suggest the way? currently, I am getting an error as below:

Caused by: java.lang.IllegalStateException: No Scope registered for scope name 'job'

Kindly help...

Find the Job.xml below

<job id="partitionJob" xmlns="http://www.springframework.org/schema/batch"        restartable="true">
    <step id="step" allow-start-if-complete="true">
        <partition step="step2" partitioner="partitioner">
            <handler grid-size="3" task-executor="taskExecutor" />
        </partition>
    </step>
</job>

    <bean id="partitioner" class="com.range.part.RangePartitioner">
</bean>

<bean id="taskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor" />

<step id="step2" xmlns="http://www.springframework.org/schema/batch">
    <tasklet transaction-manager="transactionManager">
        <chunk  reader="itemReader" writer="cutomitemWriter" processor="itemProcessor" commit-interval="100" />
    </tasklet>
</step>
<bean id="itemProcessor" class="com.range.processor.UserProcessor" scope="step">
<property name="threadName" value="#{stepExecutionContext[name]}"/>
</bean>

<bean id="itemReader" class="org.springframework.batch.item.file.FlatFileItemReader" scope="job">
 <property name="resource" value="file:#{jobParameters[file]}"> 
 </property>    
  <!-- <property name="linesToSkip" value="1"/> -->
<property name="lineMapper">
        <bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
            <property name="lineTokenizer">
                <bean class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
                    <property name="delimiter" value="," />
                    <!--  <property name="names" value="transactionBranch,batchEntryDate,batchNo,channelID,CountryCode" />-->
        </bean>
            </property>
            <property name="fieldSetMapper">
                <bean class="com.fieldset.FieldsetMapper">

                </bean>
            </property>
        </bean>
    </property>
    </bean>

<bean id="cutomitemWriter" class="com.range.processor.customitemWritter">
</bean>

The information provided is insufficient. You need to post some code. — marthursson, Aug 23 '16 at 05:38
Looks like you are reading file within each thread than outside thread. — Vinay Prajapati, Aug 23 '16 at 06:07
Do you want them to be read on different threads (not necessarily the most performant option) or just processed on different threads? — Michael Minella, Aug 23 '16 at 17:22
[Thanks a lot for your immediate response.. ideally i want them to be read on different threads. If I am giving an resource during context setting time (by hard coding the file name in XML) instead of late binding through job parameters the reading itself happened with different threads.However I understand from your reply it is not most performing option. If this is not the recommended design approach request you to share the configuration/steps to process on different threads. Since i am newbie to spring batch need your guidance and design thought process. Thanks in advance. ] — gautam, Aug 23 '16 at 18:05
And the reason why I wanted to read by different thread is that since its a chunk process reading alone can't be single and other can be multiple. And if I separate the reading alone as first task how to pass the data to other task to process by multiple threads. Ideally i wanted to read a file and assign each record to different thread to process — gautam, Aug 24 '16 at 03:17

Nghia Do · Answer 1 · 2016-08-27T17:23:23.500

1

I'm thinking a way which we can use Partitioner on top of it. At the partitioner level, we can reader the file (by using any CSV reader or Spring Reader also fine) and then process each line.

Every line will be added to the partitioner's queue (Map) so it achieves your requirement.

I have posted here code for your reference

public class LinePartitioner implements Partitioner {

@Value("#{jobParameters['fileName']}")
private String fileName;

Map<String, ExecutionContext> queue = new HashMap<>();

@Override
public Map<String, ExecutionContext> partition(int gridSize) {

    BufferedReader reader = new BufferedReader(new FileReader(this.fileName));
    List<String> lines = new ArrayList<>();
    int count = 0;
    while ((line = reader.readLine()) != null) {

        ExecutionContext value = new ExecutionContext();
        value.put("lineContent", line);
        value.put("lineCount", count+1);

        queue.put(++count, value);
    }

    return queue;
}

}

As above code, you can replace Reader by any CSV reader or Spring Reader to simplified mapping field with Pojo object.

Please let me know if you need to full program, I will write and upload for you.

Thanks, Nghia

-- Update with an example to build Partitioner with 1000 items reader for Reader

@Override
    public Map<String, ExecutionContext> partition(int gridSize) {
        try {
            Map<String, ExecutionContext> queue = new HashMap<>();

            List<List<String>> trunks = new ArrayList<>();

            // read and store data to a list of trunk
            int chunkSize = 1000;
            int count = 1;
            try (BufferedReader br = new BufferedReader(new FileReader("your file"))) {
                String line;
                List items = null;
                while ((line = br.readLine()) != null) {
                    if (count % chunkSize == 0) {
                        items = new ArrayList();
                        trunks.add(items);
                    }

                    items.add(line);
                }
            }

            // add to queue to start prorcessing
            for (int i=0; i<trunks.size(); i++) {
                ExecutionContext value = new ExecutionContext();
                value.put("items", trunks.get(i));
                queue.put("trunk"+i, value);
            }

            return queue;
        }

        catch (Exception e) {
            // handle exception
        }
}

edited Aug 27 '16 at 17:23

answered Aug 24 '16 at 12:50

Nghia Do

2,588
2
17
31

Hi, Thank you soo much.. I have started implementing your approach.. Once it is done i will let you know if i need any help... And tried another approach also.. – gautam Aug 25 '16 at 04:57
I have splitted the original file into multiple files and using MultiFileResourcePartitioner assigned individual file to each thread. Kindly confirm which approach will give more performance benefit... thanks in advance – gautam Aug 25 '16 at 05:00
In your case, since file reading is taken care in partition class itself, directly i can proceed to process the records which is coming through ExceutionContext right? Another seperate reader is not required right? – gautam Aug 25 '16 at 05:03
For you question, it depends on couple factors to drive to the design such as 1. How big of your file? 2. What will be acceptance performance for your system? If your file is not huge one, you no need to have different readers for it. – Nghia Do Aug 25 '16 at 11:20
It will be very huge file and we will get more than 50,000 transactions will in one file – gautam Aug 25 '16 at 12:08
I got struck with your approach.... I need to split the records per thread using grid size how to allocate? consider per thread should pick up 500 records from execution context – gautam Aug 25 '16 at 12:11
I don't have your full requirement such as: what processor and writer do, how strong of your system such as RAM, CPU ... And remember threading doesn't mean it speeds up performance of your program. With information you gave, I will go with commit-interval 5000. – Nghia Do Aug 25 '16 at 12:37
I need to fill the records per thread in partitioner class. Can you tell how to fill the records with the limit of commit intervals in partitioner class? – gautam Aug 25 '16 at 14:24
Commit Internal is for Writer as below – Nghia Do Aug 25 '16 at 16:00
hi, Can you please help to write partitioner class to assign 1000 records per thread – gautam Aug 26 '16 at 04:48
Don't worry about 50.000 lines of transaction. Java can handle it with BufferReader. Spring Batch including Partitioner, Reader, Processor and Writer is flexible enough to cover your requirement. You can use Partitioner as Root Source. In Partitioner, you can use BufferReader to load entire lines to a list and store them into a List and construction 1000-item trunk. And start process each of them. – Nghia Do Aug 27 '16 at 17:13
Thank you soooo much Nghia Do.... It is working ... Thanks for you help.... – gautam Sep 01 '16 at 03:46
It's Great. Could you mark as answer so it can be closed. – Nghia Do Sep 01 '16 at 04:52

score 0 · Answer 2 · answered Oct 05 '17 at 14:24

You can see this example (on Github) with a multi-threading job for importing a big CSV file (like 200,000 lines) into DB and exporting it from DB to a JSON file (FileReader and the FileWriter will have No-Thread Safe).

<batch:job id="transformJob">
    <batch:step id="deleteDir" next="cleanDB">
        <batch:tasklet ref="fileDeletingTasklet" />
    </batch:step>
    <batch:step id="cleanDB" next="countThread">
        <batch:tasklet ref="cleanDBTasklet" />
    </batch:step>
    <batch:step id="countThread" next="split">
        <batch:tasklet ref="countThreadTasklet" />
    </batch:step>
    <batch:step id="split" next="partitionerMasterImporter">
        <batch:tasklet>
            <batch:chunk reader="largeCSVReader" writer="smallCSVWriter"
                commit-interval="#{jobExecutionContext['chunk.count']}" />
        </batch:tasklet>
    </batch:step>
    <batch:step id="partitionerMasterImporter" next="partitionerMasterExporter">
        <partition step="importChunked" partitioner="filePartitioner">
            <handler grid-size="10" task-executor="taskExecutor" />
        </partition>
    </batch:step>
    <batch:step id="partitionerMasterExporter" next="concat">
        <partition step="exportChunked" partitioner="dbPartitioner">
            <handler grid-size="10" task-executor="taskExecutor" />
        </partition>
    </batch:step>
    <batch:step id="concat">
        <batch:tasklet ref="concatFileTasklet" />
    </batch:step>
</batch:job>

<batch:step id="importChunked">
    <batch:tasklet>
        <batch:chunk reader="smallCSVFileReader" writer="dbWriter"
            processor="importProcessor" commit-interval="500">
        </batch:chunk>
    </batch:tasklet>
</batch:step>

<batch:step id="exportChunked">
    <batch:tasklet>
        <batch:chunk reader="dbReader" writer="jsonFileWriter"
            processor="exportProcessor" commit-interval="#{jobExecutionContext['chunk.count']}">
        </batch:chunk>
    </batch:tasklet>
</batch:step>

<bean id="jsonFileWriter" class="com.batch.writer.PersonWriterToFile"
    scope="step">
    <property name="outputPath" value="csv/chunked/paged-#{stepExecutionContext[page]}.json" />
</bean>

<bean id="dbReader" class="com.batch.reader.PersonReaderFromDataBase" scope="step">
    <property name="iPersonRepository" ref="IPersonRepository" />
    <property name="page" value="#{stepExecutionContext[page]}"/>
    <property name="size" value="#{stepExecutionContext[size]}"/>
</bean>

<bean id="countThreadTasklet" class="com.batch.tasklet.CountingTasklet"
    scope="step">
    <property name="input" value="file:csv/input/#{jobParameters[filename]}" />
</bean>

<bean id="cleanDBTasklet" class="com.batch.tasklet.CleanDBTasklet" />

<bean id="fileDeletingTasklet" class="com.batch.tasklet.FileDeletingTasklet">
    <property name="directory" value="file:csv/chunked/" />
</bean>

<bean id="concatFileTasklet" class="com.batch.tasklet.FileConcatTasklet">
    <property name="directory" value="file:csv/chunked/" />
    <property name="outputFilename" value="csv/output/export.json" />
</bean>

<bean id="filePartitioner" class="com.batch.partitioner.FilePartitioner">
    <property name="outputPath" value="csv/chunked/" />
</bean>

<bean id="dbPartitioner" class="com.batch.partitioner.DBPartitioner" scope="step">
    <property name="pageSize" value="#{jobExecutionContext['chunk.count']}" />
</bean>

<bean id="largeCSVReader" class="com.batch.reader.LineReaderFromFile"
    scope="step">
    <property name="inputPath" value="csv/input/#{jobParameters[filename]}" />
</bean>

<bean id="smallCSVWriter" class="com.batch.writer.LineWriterToFile"
    scope="step">
    <property name="outputPath" value="csv/chunked/"></property>
</bean>

<bean id="smallCSVFileReader" class="com.batch.reader.PersonReaderFromFile"
    scope="step">
    <constructor-arg value="csv/chunked/#{stepExecutionContext[file]}" />
</bean>

<bean id="importProcessor" class="com.batch.processor.ImportPersonItemProcessor" />

<bean id="exportProcessor" class="com.batch.processor.ExportPersonItemProcessor" />

<bean id="dbWriter" class="com.batch.writer.PersonWriterToDataBase">
    <property name="iPersonRepository" ref="IPersonRepository" />
</bean>

In both cases, a partionner is used to splice into 10 files (one file per thread) for import and export to 10 files (one file per thread too), then we concatenate all to have a single file.

Hope this help.

spring batch multi thread file reading

2 Answers2

Linked