3

In a search engine made with Elasticsearch what is the best solution for considering user clicks on result items to improve the scores of the documents with more user impressions ?

is there any tools or plugins ready to use or is should write it from the ground up ?

the solution is expected to consider the following as google does:

  • number of times each document has been shown
  • number of times a user has clicked on a document
  • the query which user searched (a document may be important in a specific query but unimportant in others)
  • ...
Ria
  • 10,237
  • 3
  • 33
  • 60
javad helali
  • 270
  • 2
  • 13
  • how you want ES to know about clicks? impossible, this is the thing you need implement in your system, which later will submit some click data to ES as a boost/bury values – Mysterion Jan 18 '17 at 08:11
  • i know my system should submit click data to elasticsearch. my question is how can i make elasticsearch consider this log data as part of scoring formula which contains queries and results clicked. it's not as simple as boosting a document .. – javad helali Jan 18 '17 at 08:23
  • 1
    yes, it's not simple, but something easy ad hoc could be implemented like query time boosting based on some formula let say score = initial_score + clicksw * shows or something – Mysterion Jan 18 '17 at 08:38

1 Answers1

9

If you are developing your API using rails/ruby you can look at searchkick which pretty much does the job, by making the search solution smarter everyday with more usage.

Now if you are not on rails or you want to develop your own in-house implementation, here are few suggestions on architecture from my side.

Lets first start with basic overview, key modules, downsides and adapting the architecture for those downsides in solution.

you will need

1) Scoring algorithm where you can define a equation for the formula which will generate the score for each document. lets consider the parameters you mentioned

a) no of times each document has been shown b) no of times document has been clicked. c) query with which document is searched.

now as you have not mentioned how a) and b) fits in the current context. I would assume a simpler one but if you want to build a really advanced smart solution i would also combine a) b) with c). For example - how many times the document has appeared for the given keyword. Like me searching for "snow boots" should consider this (count of appearance/no of click) only for when query was more or less like "snow boots" not for all the cases. where "snow boots" can be broken to keywords with following meta with keyword order proximation.

{
    "keyword": "snow",
    "document_ids": [3, 5, 6, 8],
    "document_ids_views": [{
        "doc_id": 3,
        "views ": 110,
        "clicks": 560
    }, {
        "doc_id": 5,
        "views": 100,
        "clicks": 78
    }, {
        "doc_id": 6,
        "views": 100,
        "clicks": 120
    }, {
        "doc_id": 3,
        "views": 100,
        "clicks": 465
    }]
}

{
    "keyword": "boots",
    "document_ids": [3, 5, 6, 8],
    "document_ids_views": [{
        "doc_id": 3,
        "views ": 100,
        "clicks": 56
    }, {
        "doc_id": 5,
        "views": 100,
        "clicks": 78
    }, {
        "doc_id": 6,
        "views": 100,
        "clicks": 120
    }, {
        "doc_id": 3,
        "views": 100,
        "clicks": 465
    }]
}

Above is the aggregated data stored in separate database for each keyword.

Like this i would build a meta data of stats on daily basis in a seperate datastore lets say mongo. If i already have "snow" in my meta and new queries comes in with this keyword i would update the same meta document.

Now i want to discuss the downside and why i choose to keep them in seperate database instead appending them to elasticsearch document.

I don't want to hammer by elasticsearch cluster every time a new query is fired to update the click counts and views counts in elastic documents as I am aware that update is very I/O extensive with inverted indexing merging.

Now to remedy this downside i would have a daily or bi-daily batch job to port these meta info to each document in elastic. I would rebuild the whole cluster with this new meta info and move the alias from old index to new index without having any downtime.

Now to relate or add this info to elastic document i would use parent-child documents relationship to map elastic document with keywords associated with this.

So my basic parent document and child document can look like

parent document

PUT /index/type/3
{
  "name":  "Reebok shoes",
  "category":   "snow boots",
  "price": 120
}

child document

PUT /index/type_meta/1?parent=3


  {
    "keyword": "boots",
    "document_id": 3,
    "doc_id": 3,
    "views ": 100,
    "clicks": 56
}

PUT /index/type_meta/1?parent=3 


 {
    "keyword": "snow",
    "document_id": 3,
    "doc_id": 3,
    "views ": 110,
    "clicks": 560
}

The above parent-child document pretty much explain how i am building meta for search statistics for each document.

Till now we have build a really smart solution to gather event data for searches stats and successfully relating them to each document in elastic.

Lets start looking at scoring query here -

I will not go deep in designing the score algo here, but i will go more toward implementing the query which can score the documents based on views, click associated with keyword and also for relevance to keywords.

Function score query

Script score

now i may choose to give more weightage to matches in name than in category. Thats all from your usecase point of view and i will not go deep in designing the score formula for you.

{
    "query": {
        "function_score": {
            "query": {
                "match_all": {}
            },
            "boost": "5",
            "functions": [{
                "filter": {
                    "match": {
                        "name": "snow"
                    }
                },
                "random_score": {},
                "weight": 200
            }, {
                "filter": {
                    "match": {
                        "name": "boots"
                    }
                },
                "weight": 200
            }, {
                "filter": {
                    "match": {
                        "category": "snow"
                    }
                },
                "random_score": {},
                "weight": 100
            }, {
                "filter": {
                    "match": {
                        "category": "boots"
                    }
                },
                "weight": 100
            }, {
                "filter": {
                    "query": {
                        "has_parent": {
                            "type": "type_meta",
                            "query": {
                                "match": {
                                    "keyword": "snow"
                                }
                            }
                        }
                    }
                },
                "script_score": {
                    "script": {
                        "lang": "painless",
                        "inline": "_score + 20*doc['clicks'].value + 40 * doc['views].value"
                    }
                }
            }, {
                "filter": {
                    "query": {
                        "has_parent": {
                            "type": "type_meta",
                            "query": {
                                "match": {
                                    "keyword": "boots"
                                }
                            }
                        }
                    }
                },
                "script_score": {
                    "script": {
                        "lang": "painless",
                        "inline": "_score + 20*doc['clicks'].value + 40 * doc['views].value"
                    }
                }
            }],

            "score_mode": "max",
            "boost_mode": "multiply"
        }
    }
}

So you can use a query simillar like the above, i have just chose a very simple formula with demo boost params for each clause and this query can be refactored furthur the implement advance scoring algo.

Script score function is important here as i am first filtering the child documents based on search keywords for that single parent document and then using script score to use click and view count to affect my overall document score.

Now this was a kind of solution i was looking to implement in my project and i am open for suggestions and improvements to my solution.

Please do share your suggestions and improvements.

Hope this helps Thanks

user3775217
  • 4,675
  • 1
  • 22
  • 33