6

I am using Elasticsearch with the Java API.

I am indexing offline data with big bulk inserts, so I set index.refresh=-1

I don't refresh the index "manually" anywhere.

It seems that refresh is done at some point, because queries do return data. The only scenario where the data wasn't returned was when I tested with just a few documents, and querying was done immediately after insertion (using the same Client object).

I wonder if index refresh is called implicitly by Elasticsearch or by the Java library at some stage, even when index.refresh=-1 ?

Or how else could the behavior be explained?

Client generation:

Client client = TransportClient.builder().settings(settings)
        .build()
        .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(address),port));

Insertion:

BulkRequestBuilder bulkRequest = client.prepareBulk();

for (MyObject object : list) {
    bulkRequest.add(client.prepareIndex(index, type)
            .setSource(XContentFactory.jsonBuilder()
                    .startObject()
                    // ... add object fields here ...
                    .endObject()
            ));
}

BulkResponse bulkResponse = bulkRequest.get();

Querying:

   QueryBuilder query = ...;

   SearchResponse resp = client.prepareSearch(index)
            .setQuery(query)
            .setSize(Integer.MAX_VALUE)
            // adding fields here 
            .get();

   SearchHit[] = resp.getHits().getHits();
daphshez
  • 9,272
  • 11
  • 47
  • 65

1 Answers1

5

The reason the documents were searchable despite refresh interval being disabled could be either due to index-buffer filling up resulting in creation of lucene segment or translog being full resulting in commit of lucene segment either of which makes the documents searchable.

As per the documentation

By default, Elasticsearch uses memory heuristics in order to automatically trigger flush operations as required in order to clear memory.

Also the index buffer settings can be manipulated as follows.

This article is a good read with regard to how data is searchable and durable.

You can also look at this SO thread written by one of elasticsearch contributers for more details between flush vs refresh.

You can use indices-stats to verify all this i.e verify if there was a flush or refresh

Example :

 GET <index_name>/_stats/refresh

 GET <index_name>/_stats/flush
Community
  • 1
  • 1
keety
  • 17,231
  • 4
  • 51
  • 56