0

I have an Elastic Search 5.2 cluster with 16 nodes (13 data nodes/3 master/24 GB RAM/12 GB Heap). I am performance testing a query and making 50 calls of a search query per second on the Elastic cluster. My query looks like the following -

{
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "cust_id": "AC-90-464690064500"
                    }
                },
                {
                    "range": {
                        "yy_mo_no": {
                            "gt": 201701,
                            "lte": 201710
                        }
                    }
                }
            ]
        }
    }
}

My index mapping is like the following -

cust_id      Keyword
smry_amt     Long
yy_mo_no     Integer    // doc_values enabled
mkt_id       Keyword
. . .
. . .
currency_cd  Keyword   // Total 10 field with 8 Keyword type

The index contains 200 million records and for each cust_id, there may be 100s of records. Index has 2 Replicas. The record size is under 100 bytes.

When I run the performance test for 10 minutes, the query response and performance seems to be very slow. Upon investigating a bit more in details in Kibana monitoring tab, It appears that there is a lot of Garbage Collection activity happening (pls. see Image below) -

Garbage Collection While Range Search Operation

I have several question clouding in my mind. I did some research on Range queries but didn't find much on what can cause GC activity in scenarios similar to mine. I also research on Memory usage and GC activity, but most of Elastic documentation refers that young generation GC is normal while Indexing, while search activity mostly use the file system cache that OS maintains. Thats why, in the chart above, Heap is not much used since Search was using File System cache.

So -

  1. What might be causing the garbage collection to happen here ?
  2. The chart shows that the Heap is still available to Elastic Search, and Used Heap is still very less as compared to available. Then what is triggering GC ?
  3. Is the query type causing any internal data structure to be created that is getting disposed off, causing GC ?
  4. The CPU spike may be due to GC activity.
  5. Is there any other efficient way of running the Range query in Elastic Search pre 5.5 versions ?
  6. Profiling the query tells that Elastic is running a TermQuery and a BooleanQuery with the later is costing the most.

Any idea whats going on here ?

Thanks in Advance,

  • SGSI.
sgsi
  • 382
  • 1
  • 8
  • 18
  • I don't think what the problem is GC related, according to your charts you had 4 collections during 1 minute with total duration about 100 ms. Could you please provide your IO stats(disc reads and writes)? – Ivan Mamontov Oct 23 '17 at 21:12

1 Answers1

0

The correct answer depends on index settings but I guess you are using integer type with enabled docValues. This data structure is supposed to support aggregations and sorting but not range queries. The right data type is range.

In case of DocValues elastic/lucene iterates over ALL documents(i.e. full scan) in order to match range query - this require to read and decode every value from DV column - this operation is quite expensive, especially when the index can not be cached by the operating system.

Ivan Mamontov
  • 2,874
  • 1
  • 19
  • 30
  • Thanks for the reply Ivan. There were a couple of other issues, and I have to focus on another high priority one, hence delayed in responding to you. Apologies. The data in the index is sub-summarized month-wise billing info. Couple of questions here: (1) How can I use range datatype to store month-wise data? (2) Now that search performance is twice as better after doubling the number of shards (and reindexing), should I disable DocValues ? Is it expected to further improve search rate ? – sgsi Nov 02 '17 at 18:00
  • After doubling the number of shards, the GC activity reduced, CPU usage halfed and search rate near doubled. I don't what is the relation between number of shards (13 shards on 13 data nodes earlier vs. now 26 shards on 13 data nodes) and GC activity, but that's how search rate improved. – sgsi Nov 02 '17 at 18:03