0

I am extracting line by line a CSV file containing more than 7M lines occupuying more than 1Gig on disk space.

The reading operation into a List<String> is fine and happens in less than 2 minutes. But the problem is when I try to loop on this list to and map each line to an object Balance then I created I get an OuyOfMemoryException:

01:00:30.664 [restartedMain] ERROR org.springframework.batch.core.step.AbstractStep - Encountered an error executing step readInputStep in job readCsvJob
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68) ~[?:1.8.0_172]
    at java.lang.StringBuffer.<init>(StringBuffer.java:128) ~[?:1.8.0_172]
    at java.text.DigitList.getStringBuffer(DigitList.java:804) ~[?:1.8.0_172]
    at java.text.DigitList.getDouble(DigitList.java:164) ~[?:1.8.0_172]
    at java.text.DecimalFormat.parse(DecimalFormat.java:2089) ~[?:1.8.0_172]
    at java.text.NumberFormat.parse(NumberFormat.java:383) ~[?:1.8.0_172]
    at fr.payet.flad.batch.mapper.BalanceLineMapper.parseToDouble(BalanceLineMapper.java:56) ~[classes/:?]
    at fr.payet.flad.batch.mapper.BalanceLineMapper.toBalance(BalanceLineMapper.java:40) ~[classes/:?]
    at fr.payet.flad.batch.tasklet.ReadInputTasklet.execute(ReadInputTasklet.java:56) ~[classes/:?]

Here is my BalanceLineMapper code :

@Component
@Slf4j
public class BalanceLineMapper {

    public Balance toBalance(String[] ligneCsv, int cursorIndex) {
        try {
            return Balance.builder()
                    .index(cursorIndex)
                    .exer(ligneCsv[0])
                    .ident(ligneCsv[1])
                    .nDept(ligneCsv[2])
                    .lBudg(ligneCsv[3])
                    .insee(ligneCsv[4])
                    .siren(ligneCsv[5])
                    .cRegi(ligneCsv[6])
                    .nomen(ligneCsv[7])
                    .cType(ligneCsv[8])
                    .cstyp(ligneCsv[9])
                    .cActi(ligneCsv[10])
                    .finess(ligneCsv[11])
                    .secteur(ligneCsv[12])
                    .cBudg(ligneCsv[13])
                    .codBud1(ligneCsv[14])
                    .compte(ligneCsv[15])
                    .BEDeb(ligneCsv[16])
                    .BECre(parseToDouble(ligneCsv[17]))
                    .OBNetDeb(parseToDouble(ligneCsv[18]))
                    .OBNetCre(parseToDouble(ligneCsv[19]))
                    .ONBDeb(parseToDouble(ligneCsv[20]))
                    .ONBCre(parseToDouble(ligneCsv[21]))
                    .OOBDeb(parseToDouble(ligneCsv[22]))
                    .OOBCre(parseToDouble(ligneCsv[23]))
                    .sd(parseToDouble(ligneCsv[24]))
                    .sc(parseToDouble(ligneCsv[25]))
                    .build();
        } catch (NumberFormatException e) {
            log.debug("Erreur lors de du casting");
        }
        return null;
    }

    private Double parseToDouble(String number){
        NumberFormat format = NumberFormat.getInstance(Locale.FRANCE);
        try {
             return format.parse(number).doubleValue();
        }catch (ParseException e){
            log.error("Erreur de parsing de {} en Java Double", number, e.getMessage(), e);
        }
        log.error("parseToDouble retourne la valeur NULL");
        return null;
    }

}

and ReadInputTasklet code :

@Slf4j
@Component
public class ReadInputTasklet implements Tasklet, StepExecutionListener {

    @Autowired
    BalanceLineMapper balanceLineMapper;

    @Override
    public RepeatStatus execute(StepContribution stepContribution, ChunkContext chunkContext) throws Exception {
        List<Balance> balances = Lists.newArrayList();
        List<String> balancesList = Lists.newArrayList();
        try {
            CSVReader reader = new CSVReader(new FileReader("/Users/ghassen/Desktop/FLAD/Balance_Commune_2016.csv"), '\n');
            String[] nextLine;
            int cursorIndex = 0;
            while ((nextLine = reader.readNext()) != null) {
                if (cursorIndex != 0){
                    balancesList.add(nextLine[0]);
                    log.debug("{} balance(s) ajoutée(s) dans la liste ...", balancesList.size());
                }
                cursorIndex++;
            }
            log.debug("Lecture de toutes les lignes terminé");

            log.debug("Parsing de toutes les lignes");
            for (String line : balancesList){
                String[] lineSeperated = StringUtils.splitByWholeSeparatorPreserveAllTokens(line,";");
                balances.add(balanceLineMapper.toBalance(lineSeperated, cursorIndex));
            }
            log.debug("Job terminé");
        } catch (IOException e) {
            log.error("File not found", e);
        }
        return RepeatStatus.FINISHED;
    }

    @Override
    public void beforeStep(StepExecution stepExecution) {

    }

    @Override
    public ExitStatus afterStep(StepExecution stepExecution) {
        return null;
    }
}
Ghassen
  • 591
  • 1
  • 15
  • 33
  • I do not know Spring, but you may have to increase its heap settings. -Xmx would be the related command line argument for a vanilla JVM, e.g. -Xmx6G would set the upper limit to 6 gigabytes. Perhaps this way: https://stackoverflow.com/questions/23072187/how-to-configure-heap-size-when-starting-a-spring-boot-application-with-embedded – tevemadar Jul 05 '18 at 23:28

2 Answers2

1

You are creating tons of instances (including the strings, which you are parsing later) in a short time, in which the garbage collector can't keep up. I recommend you to build the whole system in a stream design and to only parse the ones that you actually will need.

AUser
  • 105
  • 1
  • 7
  • My previous solution was to parse directly the read line and add it to `List`. And I need all of them because I will persist them in the database after that. – Ghassen Jul 05 '18 at 23:27
1

I agree with @AUser. However, let me be more specific. You can replace your function of parseToDouble with the standard Double.valueOf(). It should be much more efficient.

David Medinets
  • 5,160
  • 3
  • 29
  • 42