1

Need to read data from ES 1.7 to index to 6.7. As there is no upgrade available. Need to index almost 5 TB data of 200 million records. We are using ES_REST_high_level_client(6.7.2) using the search and scroll approach. but not able to scroll using the scroll id. and another approach tried is using from and batch size. initially the read is faster as the from offset increase the read is really bad. what is the best approach to do.

1st Approach using search and scroll.

            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
            searchSourceBuilder.size(10);
            searchRequest.source(searchSourceBuilder);
            searchRequest.scroll(TimeValue.timeValueMinutes(2));
            SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
            String scrollId = searchResponse.getScrollId();

    while (run) {
                SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
                scrollRequest.scroll(TimeValue.timeValueSeconds(60));
                SearchResponse searchScrollResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
                scrollId = searchScrollResponse.getScrollId();
                hits = searchScrollResponse.getHits();

                if (hits.getHits().length == 0) {
                    run = false;
                }
            }

Exception Exception in thread "main" ElasticsearchStatusException[Elasticsearch exception [type=exception, reason=ElasticsearchIllegalArgumentException[Failed to decode scrollId]; nested: IOException[Bad Base64 input character decimal 123 in array position 0]; ]] at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177) at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:2050) at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:2026) :

2nd approach :

int offset = 0;
        boolean run = true;
        while (run) {
            SearchRequest searchRequest = new SearchRequest("indexname");
            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
            searchSourceBuilder.from(offset);
            searchSourceBuilder.size(500);
            searchRequest.source(searchSourceBuilder);
            long start = System.currentTimeMillis();
            SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
            long end = System.currentTimeMillis();

            SearchHits hits = searchResponse.getHits();
            System.out.println(" Total hits : " + hits.totalHits + " time : " + (end - start));
            offset += 500;
            if(hits.getHits().length == 0) {
                run = false;
            }
        }

Any other approach to read data.

Machavity
  • 30,841
  • 27
  • 92
  • 100
Akshay
  • 177
  • 1
  • 3
  • 14
  • What about 1) slicing you data logically (by date for instance) then 2) using logstash with elasticsearch input and elasticsearch output? – ugosan May 08 '19 at 20:59
  • the read will be still costly.. the index is grown very big.. now slicing the index is bit difficult.. as we are in 1.7 ES version.. – Akshay May 09 '19 at 06:34
  • my suggestion about slicing has nothing to do with the version, you would do multiple parallel queries, each with their own date range. – ugosan May 09 '19 at 14:05

1 Answers1

0

Generally the best solution would be a remote reindex: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/docs-reindex.html#reindex-from-remote

I'm not sure the REST clients are still compatible with 1.x while remote reindex should support it.

Deep pagination is very expensive that's why it should be avoided — you see why in your example.

xeraa
  • 10,456
  • 3
  • 33
  • 66
  • Its not working because version is old index :) and we want to filtering and update in the doc. – Akshay May 09 '19 at 06:39
  • Remote reindex from 1.7 to 6.x should IMO still work. This supports filtering and you can update your docs with a script. Changing the type to _doc would be something I‘d clean up right now to make the upgrade to 7 simpler – xeraa May 09 '19 at 09:42