1

I am currently working on Elasticsearch with a huge number of documents(around 500K) in an index. I want to store n-grams of each document's text data(This is also huge ~ per doc contains 2 pages of text data) in another index. So I calculating term vectors and their count in each document to store them in the new index. So I can execute aggregation queries on the new Index.

The setting of the old index has enabled me to execute termvector and mtermvector API's. I don't want to hit too many requests to Elasticsearch server in a short amount of time so I am going with mtermvectors python API. I am trying to get termvectors of 25 documents by passing id's of 25 documents.

Sample HTTP URL after calling mtermvector API in python

http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false

Some times I am getting expected response but most of the times I am getting the following error:

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors.

Reason: Error reading from remote server

Index setting and mapping

{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingleAnalyzer": {
          "tokenizer": "letter_tokenizer",
          "filter": [
            "lowercase",
            "custom_stop",
            "custom_shingle",
            "custom_stemmer",
            "length_filter"
          ]
        }
      },
      "filter": {
        "custom_stemmer": {
          "type": "stemmer",
          "name": "english"
        },
        "custom_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "custom_shingle": {
          "type": "shingle",
          "min_shingle_size": "2",
          "max_shingle_size": "4",
          "filler_token":""
        },
        "length_filter": {
          "type": "length",
          "min": 2
        }
      },
      "tokenizer": {
        "letter_tokenizer": {
          "type": "letter"
        }
      }
    }
  },
  "mappings": {
    "properties": {"article_id":{"type": "text"},
      "plain_text": {
        "term_vector": "with_positions_offsets_payloads",
        "store": true,
        "analyzer": "shingleAnalyzer",
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

I don't think there is any problem with this setting and mapping as sometimes I am getting expected response.

Please let me know if you need more information from my side. Any help will be appreciated.

  • 1
    Doesn't look like an ES error to me. Would need more context behind those proxy errors. – Joe - GMapsBook.com Apr 19 '20 at 22:28
  • I checked the server logs as I was getting proxy error from the server. I found that there is a timeout issue. I increased that timeout to 10 minutes and try to hit the same HTTP request multiple times. Still getting same proxy error. What could be the possible reason behind this? – Ketan Krishna Patil Apr 21 '20 at 00:53
  • So no request got thru the proxies? – Joe - GMapsBook.com Apr 21 '20 at 08:21
  • @jzzfs Some requests are going through a proxy. I am getting the expected response sometimes. If I run same HTTP request it's working sometimes and giving timeout error other times. I checked logs of the server as well and it is not showing any other errors. – Ketan Krishna Patil Apr 22 '20 at 03:39
  • 1
    @KetanKrishnaPatil As you have mentioned, you have a lots of text data, that means your index size might be very huge. Your elasticsearch data node mostly taking time to swap in and out the index. try to reduce the index size by disabling the index on unwanted fields. Also try to pass the higher timeout value in you API call. https://stackoverflow.com/questions/28287261/connection-timeout-with-elasticsearch – Shubham Najardhane Apr 24 '20 at 10:06

0 Answers0