0

I am looking for an approach to deal with elasticsearch's relevance for document names like "bottle" and "bottle caps"

When someone looks for a "bottle" (search term), - "bottle caps" should be scored lower than "Red bottles".

Currently our search engine scores "red coloured bottle" to be less relevant than "Bottle caps for 500ml bottle"

vishnu
  • 147
  • 1
  • 3
  • 18

1 Answers1

1

This is not something you can solve in Elasticsearch, without adding more information. You want to rank "red bottles" over "bottle caps" because you know semantic information about these names -- you know that "red bottles" means the thing it's talking about is a "bottle", and "bottle caps" means the thing it's talking about is something else (related to bottles, but not actually a bottle). If you want ranking from Elasticsearch to take this information into account, you have to index the information (maybe add a keyword tag field, one with "bottle" and one with "bottle caps" -- you will have to experiment to see what works with your use case). Of course this means that a person has to ad tags for everything.

However, I suspect you can improve the situation some with the unique filter. My guess is that you don't care a lot about term frequency in a single title ("Bottle caps for 500ml bottle" isn't more about bottles because "bottle" appears twice in it -- term frequency makes little sense for titles like this I think). So you could do something like this:

PUT /myindex
{
  "settings": {
    "index": {
      "number_of_shards": 1
    },
    "analysis": {
      "analyzer": {
        "uniq_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "porter_stem",
            "unique"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "uniq_analyzer"
        }
      }
    }
  }
}

PUT /myindex/doc/1
{"name": "Red coloured bottles"}

PUT /myindex/doc/2
{"name": "Bottle caps for 500ml bottle"}

Then if you search bottle, you'll see the scores are identical -- not perfect, but an improvement. In case you want to understand where a score is coming from, you can use explain:

POST /myindex
{
  "explain": true,
  "query": {
    "match": 
      {"name": "bottle"}
  }
}
dshockley
  • 1,494
  • 10
  • 13
  • Thank you dshockley. I had to double check if the manual product tagging could be avoided. I wonder document gets tagged manually in the big marketplaces like eBay where higher term frequency doesn't necessarily mean higher relevancy. Do they leave it to the sellers to categorize their products correctly? – vishnu Sep 06 '17 at 08:43
  • You could certainly try to tag automatically, but you would probably want to start with some labeled training data. You could also try some NLP approach (do POS tagging, and then add as tags anything classified NNS -- it would help in your example, but I'm not sure whether it would hurt somewhere else). I don't know whether ebay / amazon marketplace / etc. do any automatic tagging or just rely on the seller. If I had to design it, I would probably rely on the seller, but use an automated technique to flag items that might be mis-categorized for review. – dshockley Sep 06 '17 at 09:29