0

We have large csv file with 100 millions records, and used spring batch to load, read and write to database by splitting file with 1 million records using "SystemCommandTasklet". Below is snippet,

 @Bean
@StepScope
public SystemCommandTasklet splitFileTasklet(@Value("#{jobParameters[filePath]}") final String inputFilePath) {
    SystemCommandTasklet tasklet = new SystemCommandTasklet();

    final File file = BatchUtilities.prefixFile(inputFilePath, AppConstants.PROCESSING_PREFIX);

    final String command = configProperties.getBatch().getDataLoadPrep().getSplitCommand() + " " + file.getAbsolutePath() + " " + configProperties.getBatch().getDataLoad().getInputLocation() + System.currentTimeMillis() / 1000;
    tasklet.setCommand(command);
    tasklet.setTimeout(configProperties.getBatch().getDataLoadPrep().getSplitCommandTimeout());

    executionContext.put(AppConstants.FILE_PATH_PARAM, file.getPath());

    return tasklet;
}

and batch-config:

batch:
  data-load-prep:
    input-location: /mnt/mlr/prep/
    split-command: split -l 1000000 --additional-suffix=.csv       
    split-command-timeout: 900000 # 15 min
    schedule: "*/60 * * * * *"
    lock-at-most: 5m

With above config, I could able to read load and write successfully to database. However, found a bug with below snippet that, after splitting the file, only first file will have headers, but next splitted file does not have hearders in the first line. So, I have to either disable or avoid linesToSkip(1) config for FlatFileItemReader(CSVReader).

    @Configuration
public class DataLoadReader {

    @Bean
    @StepScope
    public FlatFileItemReader<DemographicData> demographicDataCSVReader(@Value("#{jobExecutionContext[filePath]}") final String filePath) {
        return new FlatFileItemReaderBuilder<DemographicData>()
                .name("data-load-csv-reader")
                .resource(new FileSystemResource(filePath))
                .linesToSkip(1) // Need to avoid this from 2nd splitted file onwards as splitted file does not have headers
                .lineMapper(lineMapper())
                .build();
    }

    public LineMapper<DemographicData> lineMapper() {
        DefaultLineMapper<DemographicData> defaultLineMapper = new DefaultLineMapper<>();
        DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();

        lineTokenizer.setNames("id", "mdl65DecileNum", "mdl66DecileNum", "hhId", "dob", "firstName", "middleName",
                "lastName", "addressLine1", "addressLine2", "cityName", "stdCode", "zipCode", "zipp4Code", "fipsCntyCd",
                "fipsStCd", "langName", "regionName", "fipsCntyName", "estimatedIncome");

        defaultLineMapper.setLineTokenizer(lineTokenizer);
        defaultLineMapper.setFieldSetMapper(new DemographicDataFieldSetMapper());
        return defaultLineMapper;
    }
}

Note: Loader should not skip first row from second file while loading.

Thank you in advance. Appreciate any suggestions.

Bheeresh
  • 3
  • 2

1 Answers1

0

I would do it in the SystemCommandTasklet with the following command:

tail -n +2 data.csv | split -l 1000000 --additional-suffix=.csv

If you really want to do it with Java in your Spring Batch job, you can use a custom reader or an item processor that filters the header. But I would not recommend this approach as it introduces an additional test for each item (given the large number of lines in your input file, this could impact the performance of your job).

Mahmoud Ben Hassine
  • 28,519
  • 3
  • 32
  • 50
  • `I would do it in the SystemCommandTasklet with the following command:` Could you explain it how does it fixes the problem? Like you said, having filter to verify and skip the headers is eating up the system resources. So, I'm looking for a fix that does not kill the performance. – Bheeresh May 31 '22 at 06:36
  • I updated the answer with the correct syntax. The `tail -n +2` command will skip the first line and then the `split` command will split the rest of the file into partitions. The first partition will not have the header in it (and other partitions as well), so you don't need the `.linesToSkip(1) // Need to avoid this from 2nd splitted file onwards as splitted file` in your reader's configuration. – Mahmoud Ben Hassine May 31 '22 at 08:26
  • Thank you!! It perfectly suites the requirement. But, still there is a performance gap with this approach as well. – Bheeresh Jun 03 '22 at 09:14