0

I'm new to elasticsearch, my queries are slow when i do should match with multiple search terms and also for matching nested documents, basically it is taking 7-10 sec for first query and 5-6 sec later on due to elasticsearch cache, but queries for non nested objects with just match works fast i.e within 100ms .

i'm running elastic search in aws instance with 250GB RAM and 500GB disk space, i have one template and 204 indexes with total of around 107 Million document indexed with 2 shards per index in a single node, and i have kept 30GB heap size.

following is my memory usage: memory

i can have nested objects more than 50k so i have increased length to 500k, searching on this nested objects is taking too much time and any OR (should match) operations on fields other than nested also taking time, is there any way i can boost my query performance for nested objects? or is there anything wrong in my configuration? And is there any way i can make first query also faster?

{
  "index_patterns": [
    "product_*"
  ],
  "template": {
    "settings": {
      "index.store.type": "mmapfs",
      "number_of_shards":2,
      "number_of_replicas": 0,
      "index": {
        "store.preload": [
          "*"
        ],
        "mapping.nested_objects.limit": 500000,
        "analysis": {
          "analyzer": {
            "cust_product_name": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "english_stop",
                "name_wordforms",
                "business_wordforms",
                "english_stemmer",
                "min_value"
              ],
              "char_filter": [
                "html_strip"
              ]
            },
            "entity_name": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "english_stop",
                "business_wordforms",
                "name_wordforms",
                "english_stemmer"
              ],
              "char_filter": [
                "html_strip"
              ]
            },
            "cust_text": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "english_stop",
                "name_wordforms",
                "english_stemmer",
                "min_value"
              ],
              "char_filter": [
                "html_strip"
              ]
            }
          },
          "filter": {
            "min_value": {
              "type": "length",
              "min": 2
            },
            "english_stop": {
              "type": "stop",
              "stopwords": "_english_"
            },
            "business_wordforms": {
              "type": "synonym",
              "synonyms_path": "<some path>/business_wordforms.txt"
            },
            "name_wordforms": {
              "type": "synonym",
              "synonyms_path": "<some path>/name_wordforms.txt"
            },
            "english_stemmer": {
              "type": "stemmer",
              "language": "english"
            }
          }
        }
      }
    },
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "product_number": {
          "type": "text",
          "analyzer": "keyword"
        },
        "product_name": {
          "type": "text",
          "analyzer": "cust_case_name"
        },
        "first_fetch_date": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
        },
        "last_fetch_date": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
        },
        "review": {
          "type": "nested",
          "properties": {
            "text": {
              "type": "text",
              "analyzer": "cust_text"
            },
            "review_date": {
              "type": "date",
              "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
            }
          }
        }
      }
    },
    "aliases": {
      "all_products": {}
    }
  },
  "priority": 200,
  "version": 1,
}

if i search for any specific term in review text the response is taking too much time.

{
    "_source":{
        "excludes":["review"]
    },
    "size":1,
    "track_total_hits":true,
    "query":{
        "nested":{
            "path":"review",
            "query":{
                "match":{
                    "review.text":{
                        "query":"good",
                        "zero_terms_query":"none"
                    }
                }
            }
        }
    },
    "highlight":{
        "pre_tags":[
            "<b>"
        ],
        "post_tags":[
            "</b>"
        ],
        "fields":{
            "product_name":{
                
            }
        }
    }
}

I'm sure I'm missing something obvious.

Code Wizard
  • 46
  • 2
  • 7

1 Answers1

0

Easy things : track_total_hits should be set to false. A maintenance with a force merge could help also.

The difference between fisrt and next request time is due to elasticsearch cache.

But If my comprehension is good you can have more than 50k reviews on a doc ? If it's right it's to much. Could you think of inverting your mapping ? having a review index which embed the related product in and object. It should be much faster.

PUT reviews 
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "review_date": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
      },
      "product": {
        "properties": {
          "product_number": {
            "type": "text",
            "analyzer": "keyword"
          },
          "product_name": {
            "type": "text"
          },
          "first_fetch_date": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
          },
          "last_fetch_date": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
          }
        }
      }
    }
  }
}
Jaycreation
  • 2,029
  • 1
  • 15
  • 30
  • Thanks for the reply, but i need to get total documents matched to show in my site that's why i enabled track_total_hits and the mapping you suggested doesn't work for me, i need to have all reviews mapped for a product not product mapped for each reviews, in your case if i search for a product i get multiple results if i'm not wrong. – Code Wizard Oct 29 '20 at 09:23
  • You could do it with 2 requests. Get the matched products ids in the reviews index then do a search with an ids query on the first index. It should be faster. You could also play with aggregation on products names as an example. If you get several products with 50k+ reviews on it and want to get " all reviews mapped for a product" your response must be very heavy no? By the way your source does not exclude anything in your example query – Jaycreation Oct 29 '20 at 10:13
  • I have updated my query with excluding review in result, As i have mentioned i have 107M documents (products), in your cases consider i queried for a term and that matched 10k reviews from one product and i want all the products that matches any of its reviews for my given term, in your case i get 10k result for one product and if it matched 1M products how many result set i might get and how can i use all id's from first query and search in second, hope you understand my point. – Code Wizard Oct 29 '20 at 10:32
  • if you add a size 0 and a term agg + top_hits you will get just what you need. But if you want to get all results for a query which returns 1M products you will have to paginate. (use the scroll query or something like this) I can think of very few use cases where you'll need all of the results. For a UI, first 10 or firsts 100 will be enough. For a batch, it's better to use a scroll. 107M document is not a lost in elastic. (it could be better in your case to have 2 or 3 smaller nodes than a big node) – Jaycreation Oct 29 '20 at 10:46
  • that might work, but i see redundant data of product, if i separate review index and product index, i cant search with product details along with review, if i want to do so we need to have product details in each review also, i don't think that is a good idea. I think we are heading in a different direction. – Code Wizard Oct 29 '20 at 11:53