1

I need to fetch more than 1,000,000 records from Elasticsearch using Java RestHighLevelClient

I am using scroll for pagination and everything is working fine.

Code Looks like something this:

class ScrollTest {

final static RestHighLevelClient client = LocalhostClient.create();

public static void main(String[] args) throws IOException {

    long st= System.currentTimeMillis();

    SearchRequest searchRequest = new SearchRequest("movies_data");
   

    QueryBuilder matchQueryBuilder = QueryBuilders.boolQuery().must(new MatchAllQueryBuilder());

    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

    searchSourceBuilder.query(matchQueryBuilder);

    searchSourceBuilder.size(10000); //max is 10000

    searchRequest.indices("movies_data");

    searchRequest.source(searchSourceBuilder);

    final Scroll scroll = new Scroll(TimeValue.timeValueSeconds(5l));

    searchRequest.scroll(scroll);

    SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);

    String scrollId = searchResponse.getScrollId();
    SearchHit[] data = new SearchHit[0];

    SearchHit[] searchHits = searchResponse.getHits().getHits();

    while (searchHits != null && searchHits.length > 0) {

        // funcation for  conatcting results

        SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);

        scrollRequest.scroll(scroll);

        searchResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);

        scrollId = searchResponse.getScrollId();

        searchHits = searchResponse.getHits().getHits();
        System.out.println("###################"+searchHits.length);

    }

    ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
    clearScrollRequest.addScrollId(scrollId);
    ClearScrollResponse clearScrollResponse = client.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
   
    System.out.println("Time Taken"+(System.currentTimeMillis()-st));

}

}

But Fetching large number of document using scroll is taking a lot of time since its querying for every 10000 documents so end-up with 100 request. Overriding the default windows size values also not helped much ( tried with 100, 1000, 10000 and 25000)

I want to make request fetching parallel at least within page using slice scroll, but missing how to use it with scroll.

Can someone please guide how to use slice builder to achieve parallelism ?

sam
  • 47
  • 6

0 Answers0