0

I have been using Elasticsearch 7.6 and PHP client API for all the operations. I have created elasticsearch index settings and mappings as follows

$params = [
    'index' => $index,
    'body' => [
        'settings' => [
            "number_of_shards" => 1,
            "number_of_replicas" => 0,
            "index.queries.cache.enabled" => false,
            "index.soft_deletes.enabled" => false,
            "index.refresh_interval" => -1,
            "index.requests.cache.enable" => false,
            "index.max_result_window"=> 2000000
        ],
        'mappings' => [
            '_source' => [
                "enabled" => false
             ],
             'properties' => [
                "text" => [
                        "type" => "text",
                        "index_options" => "docs"
                ]
        ]
     ]
    ]
];

My Boolean OR search query is as follows

$json = '{
"from" : 0, "size" : 2000000,
"query": {
       "bool": {
       "filter": {
        "match" : {
            "text" : {
            "query" : "apple orange grape banana",
            "operator" : "or"
            }
        }
    }
}
}
}';

I have indexed 2 million documents in such a way that all the documents match the query and I am also getting all the documents as expected. Since I am matching all the documents I have avoided scoring by using a filter in the bool query.

But in my log file, I am repetitively getting the following message until the query is finished executing. Sometimes I used to get the same message when indexing the documents in bulk

[2020-05-15T19:15:45,720][INFO ][o.e.m.j.JvmGcMonitorService] [node1] [gc][14] overhead, spent [393ms] collecting in the last [1.1s]
[2020-05-15T19:15:47,822][INFO ][o.e.m.j.JvmGcMonitorService] [node1] [gc][16] overhead, spent [399ms] collecting in the last [1s]
[2020-05-15T19:15:49,827][INFO ][o.e.m.j.JvmGcMonitorService] [node1] [gc][18] overhead, spent [308ms] collecting in the last [1s]

I have given 16 GB for my heap memory. No other warnings are shown in the elasticsearch log. What could be the reason for it? or is it expected when retrieving a huge number of documents?. I understand about scroll API but I am curious about why this is happening when I use large value for index.max_result_window. Help is much appreciated? Thanks in advance!

1 Answers1

1

What you see is normal behaviour for Elasticsearch with said configuration in particular, and any Java application in general.

Is it normal for ES with big index.max_result_window?

Yes. As the docs on index.max_result_window state, the amount of garbage generated is proportional to the number of documents returned by the query:

Search requests take heap memory and time proportional to from + size and this limits that memory.

Does it apply also for bulk API requests?

Yes, if your bulk request is large in size, it might trigger garbage collection.

Naturally, ES allocates the documents it needs to send back to the user on heap, immediately after that they become garbage and thus subject for garbage collection.

How does garbage collection work in Java?

You may find some relevant information for example here.

Is there a better way to query for all matching documents?

There is, for example, match_all query.

How is it better compared to making all documents match certain query? Elasticsearch does not have to query indexes and can go and fetch the documents right away (better performance and resource use).

Should I use scroll API, or is current approach good enough?

Scroll API is the recommended way since it scales far beyond the memory capacity of one Elasticsearch node (one can download 1TB of data from a cluster of few machines with some 16GB of RAM).

However, if you want to still be using the normal search queries, you may consider using from and size parameters and do pagination (so limiting the amount of documents fetched per query, and making GC better spread over time).

Hope this helps!

Nikolay Vasiliev
  • 5,656
  • 22
  • 31