Spring Batch - Reading a large flat file - Choices to scale horizontally?

Question

I have started researching Spring Batch in the last hour or two. And require your inputs.

The problem : Read a/multiple csv file(s) with 20 million data, perform minor processing, store it in db and also write output to another flat file in the least time.

Most important : I need to make choices which will scale horizontally in the future.

Questions :

Use Remote Chunking or Partitioning to scale horizontally?

Since data is in a flat file both Remote Chunking and Partitioning are bad choices?

Which multi process solution will make it possible to read from a large file, spread processing across multiple servers and update Db but finally write/output to a single file?

Does multiresourcepartitioner work across servers?

Any good tutorials you know of where something like this has been accomplished/demonstrated?

Your thoughts on how this needs to be attempted like 1) Split large file into smaller files before starting the job 2) Read one file at a time using the Item Reader...........

score 6 · Answer 1 · answered Jul 29 '14 at 18:49

Assuming "minor processing" isn't the bottle neck in the processing, the best option to scale this type of job is via partitioning. The job would have two steps. The first would split the large file into smaller files. To do this, I'd recommend using the SystemCommandTasklet to shell out to the OS to split the file (this is typically more performant than streaming the entire file through the JVM). An example of doing that would look something like this:

<bean id="fileSplittingTasklet" class="org.springframework.batch.core.step.tasklet.SystemCommandTasklet" scope="step">
    <property name="command" value="split -a 5 -l 10000 #{jobParameters['inputFile']} #{jobParameters['stagingDirectory']}"/>
    <property name="timeout" value="60000"/>
    <property name="workingDirectory" value="/tmp/input_temp"/>
</bean>

The second step would be a partitioned step. If the files are located in a place that is not shared, you'd use local partitioning. However, if the resulting files are on a network share somewhere, you can use remote partitioning. In either case, you'd use the MultiResourcePartitioner to generate a StepExecution per file. These would then be executed via the slaves (either locally running on threads or remotely listening to some messaging middleware).

One thing to note in this approach is that the order the records are processed from the original file will not be maintained.

You can see a complete remote partitioning example here: https://github.com/mminella/Spring-Batch-Talk-2.0 and a video of the talk/demo can be found here: https://www.youtube.com/watch?v=CYTj5YT7CZU

"the MultiResourcePartitioner to generate a StepExecution per file", this takes too much memory when processing many files. How can we avoid this ? — JavaDev, Aug 09 '16 at 15:29
1. What do you call "many files"? 2. Please open a new question around this. — Michael Minella, Aug 09 '16 at 15:30
Like 200.000 files. I already opened a question but no answer yet : http://stackoverflow.com/questions/38793243/performance-issue-with-multiresourcepartitioner-in-spring-batch — JavaDev, Aug 09 '16 at 15:36
@Michael Minella I've splitted big file into parts by 1 mb. How can I pass information about step count to the next step — gstackoverflow, Aug 07 '19 at 18:52

score 0 · Answer 2 · edited May 12 '20 at 18:48

used MultiResourcePartitioner for Reading large files this worked for me

@Bean
public Partitioner partitioner() {
    MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
    ClassLoader cl = this.getClass().getClassLoader();
    ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver(cl);
    Resource[] resources = resolver.getResources("file:" + filePath + "/"+"*.csv");     
    partitioner.setResources(resources);
    partitioner.partition(10);      
    return partitioner;
}

@Bean
public TaskExecutor taskExecutor() {
    ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
    taskExecutor.setMaxPoolSize(4);
    taskExecutor.afterPropertiesSet();
    return taskExecutor;
}   

@Bean
@Qualifier("masterStep")
public Step masterStep() {
    return stepBuilderFactory.get("masterStep")
            .partitioner(processDataStep())
            .partitioner("processDataStep",partitioner()) 
            .taskExecutor(taskExecutor())
            .listener(listener)
            .build();
}


@Bean
@Qualifier("processData")
public Step processData() {
    return stepBuilderFactory.get("processData")
            .<pojo, pojo> chunk(5000)
            .reader(reader)             
            .processor(processor())
            .writer(writer)         
            .build();
}



@Bean(name="reader")
@StepScope
public FlatFileItemReader<pojo> reader(@Value("#{stepExecutionContext['fileName']}") String filename) {

    FlatFileItemReader<pojo> reader = new FlatFileItemReader<>();
    reader.setResource(new UrlResource(filename));
    reader.setLineMapper(new DefaultLineMapper<pojo>() {
        {
            setLineTokenizer(new DelimitedLineTokenizer() {
                {
                    setNames(FILE HEADER);


                }
            });
            setFieldSetMapper(new BeanWrapperFieldSetMapper<pojo>() {
                {
                    setTargetType(pojo.class);
                }
            });
        }
    });
    return reader;
}

Spring Batch - Reading a large flat file - Choices to scale horizontally?

2 Answers2

Linked