0

We are currently building a search tool based on elasticsearch and our query involves matching nearest value to user input values. Say user inputs [1,10,100,1000,10000] it should return closest value available in our index to each of those elements in the array.

Right now we are using the following query to retrieve values one at a time and we are passing user input array via loop and its really slow.

{
    "query": {
        "term": {"CHR": "chr1"}
    },
    "sort" : {
        "_script" : {
            "type" : "number",
            "script" : {
                "lang": "painless",
                "params": {
              "factor": 10000
            },
                "inline": "def cur = 0; cur = (params.factor - doc['START'].value); if (cur < 0) { cur = cur * -1 } else { cur = cur}" },
            "order" : "asc"
        }
    }

}

Our requirement is that the factor would take an array of integers rather than a single value and gives our the first closest value that it finds in our index.

The complete python function is posted below (Python)

def gene_peek(coordinate, chr):
peek_liver = []
for i in range(0,len(coordinate)):
    a = int(coordinate[i])
    res = requests.post("http://localhost:9200/lab/peek_liver/_search?pretty=true&scroll=10m&size=1", json={
    "query": {
        "term": {"CHR": chr[i]}
    },
    "sort" : {
        "_script" : {
            "type" : "number",
            "script" : {
                "lang": "painless",
                "params": {
              "factor": a
            },
                "inline": "def cur = 0; cur = (params.factor - doc['START'].value); if (cur < 0) { cur = cur * -1 } else { cur = cur}" },
            "order" : "asc"
        }
    }

})
    data = res.json()
    peek_liver.append(data["hits"]["hits"][0]["_source"])
return peek_liver

Any help would be greatly appreciated. Thanks.

  • What do you mean by "its really slow"? how long does it take? which part of the flow takes which part of the time? – Yoav Gur Mar 09 '18 at 17:39
  • In the above query the two variables are values for "CHR" and "factor". Ideally I want to pass an array for CHR and factor and want the job done in single query instead of iterating it via python because the array size could be upto 300 values and for the output 300 "POST" calls are made to elasticsearch server (which is really time consuming like 6-10seconds) to give the desired output. Hope this helps. – Mr Bad Guy Mar 09 '18 at 17:45
  • I am not sure the query syntax supports such a query (maybe it does), but in case it doesn't, you might save quite some time if you stack a batch of requests, and then send it as one batch, rather than one at a time. See: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html – Yoav Gur Mar 09 '18 at 18:16
  • Looks like that would improve performance by some extent by reducing server-client communication. I will evaluate it and post an update here. Thanks. – Mr Bad Guy Mar 09 '18 at 18:36
  • Update: There was no performance gain. The query took the exact amount of time to display the output. – Mr Bad Guy Mar 11 '18 at 03:19

0 Answers0