1

I need to process data from Rest web service. the following basic exemple is :

import org.springframework.batch.item.ItemReader;
import org.springframework.http.ResponseEntity;
import org.springframework.web.client.RestTemplate;

import java.util.Arrays;
import java.util.List;

class RESTDataReader implements ItemReader<DataDTO> {

private final String apiUrl;
private final RestTemplate restTemplate;

private int nextDataIndex;
private List<DataDTO> data;

RESTDataReader(String apiUrl, RestTemplate restTemplate) {
    this.apiUrl = apiUrl;
    this.restTemplate = restTemplate;
    nextDataIndex = 0;
}

@Override
public DataDTO read() throws Exception {
    if (dataIsNotInitialized()) {
        data = fetchDataFromAPI();
    }

    DataDTO nextData = null;

    if (nextDataIndex < data.size()) {
        nextData = data.get(nextDataIndex);
        nextDataIndex++;
    }
    else {
        nextDataIndex= 0;
        data = null;
    }

    return nextData;
}

private boolean dataIsNotInitialized() {
    return this.data == null;
}

private List<DataDTO> fetchDataFromAPI() {
    ResponseEntity<DataDTO[]> response = restTemplate.getForEntity(apiUrl,
            DataDTO[].class
    );
    DataDTO[] data= response.getBody();
    return Arrays.asList(data);
}
}

However, my fetchDataFromAPI method is called with time slots and it could get more than 20 Millions objects.

For example : if i call it between 01012020 and 01012021 i'll get 80 Millions data.

PS : the web service works by pagination of a single day, i.e. if I want to retrieve the data between 01/09/2020 and 07/09/2020 I have to call it several times (between 01/09-02/09 then between 02/09-03/09 and so on until 06/09-07/09)

My problem in this case is a heap space memory if the data is bulky.

I had to create a step for each month to avoid this problem in my BatchConfiguration (12 steps). The first step which will call the web service between 01/01/2020 and 01/02/2020 etc

Is there a solution to read all this volume of data with only one step before going to the processor ??

Thanks in advance

BOUTERBIAT Oualid
  • 1,494
  • 13
  • 15
  • `the web service works by pagination of a single day`: Does this webservice provide pagination within the same day, ie a granularity smaller than a day. For example, for a single day, is it possible to query data by pages of say 100 items or 1000 items at a time? – Mahmoud Ben Hassine Apr 16 '21 at 07:49
  • No it’s by only by day and i have to precise the first and second day in the parameters – BOUTERBIAT Oualid Apr 16 '21 at 07:51

1 Answers1

1

Since your web service does not provide pagination within a single day, you need to ensure that the process that calls this web service (ie your Spring Batch job) has enough memory to store all items returned by this service.

For example : if i call it between 01012020 and 01012021 i'll get 80 Millions data.

This means that if you call this web service with curl on a machine that does not have enough memory to hold the result, then the curl command will fail. The point I want to make here is that the only way to solve this issue is to give enough memory to the JVM that runs your Spring Batch job to hold such a big result set.

As a side note: if you have control over this web service, I highly recommend you to improve it by introducing a more granular pagination mechanism.

Mahmoud Ben Hassine
  • 28,519
  • 3
  • 32
  • 50
  • Thanks Mahmoud for your response, just to precise because i wasn’t clear enough. Let’s say we wanna get data between 01012020 and 07012020, i have to call the web service 6 times for each day. It’s the only method to call this external web service – BOUTERBIAT Oualid Apr 16 '21 at 08:26
  • That was clear enough to me already, thank you. But that's the next step of the problem, ie how to discard the items of a given day from memory and proceed with the next day (this is not really an issue, we can easily make those items garbage collected by the JVM). The main problem is that you can't paginate within a single day, so you need to accept the fact that you can get more data than your allocated memory and prepare for that. – Mahmoud Ben Hassine Apr 16 '21 at 08:34
  • exactly and that's why i used multiple steps (step by month) to make sure the memory hold the data but i don't like this solution. any way thank you for your response – BOUTERBIAT Oualid Apr 16 '21 at 08:37
  • ok you are welcome. `i don't like this solution`: no problem, but I'm genuinely curious if there is a solution to this problem other than providing more memory to the process that calls the web service. – Mahmoud Ben Hassine Apr 16 '21 at 08:47
  • Exactly I’m curious too and i was sure there is a solution for this case but ‍♂️ some developers told me to work with spark instead for big set of data ... – BOUTERBIAT Oualid Apr 16 '21 at 08:50