0

I am trying to make a tagcloud of words and phrases using the facets feature of elasticsearch.

My mapping:

curl -XPOST http://localhost:9200/myIndex/ -d '{

  ...

  "analysis":{  
    "filter":{ 
      "myCustomShingle":{
        "type":"shingle",
        "max_shingle_size":3,
        "output_unigrams":true
      }
    },
    "analyzer":{ //making a custom analyzer
      "myAnalyzer":{
        "type":"custom",
        "tokenizer":"standard",
        "filter":[
          "lowercase",
          "myCustomShingle",
          "stop"
        ]
      } 
    }
  }

  ...
},
"mappings":{

   ...


   "description":{ //the field to be analyzed for making the tag cloud
     "type":"string",
     "analyzer":"myAnalyzer",
     "null_value" : "null"
   },


   ...



}

Query for generating facets:

curl -X POST "http://localhost:9200/myIndex/myType/_search?&pretty=true" -d '
{
  "size":"0",

  "query": {
    match_all:{}
  },


  "facets": {
    "blah": {
      "terms": {
        "fields" :     ["description"],
        "exclude" : [ 'evil' ], //remove facets that contain these words
        "size": "50"
      }
    }
  }
}

My problem is, when I insert a word say 'evil' in the "exclude" option of "facets", it successfully removes the facets containing the words(or single shingles) that match 'evil'. But it doesn't remove the 2/3 word shingles, "resident evil" , "evil computer", "my evil cat". How do I remove the facets of phrases containing the "exclude words"?

serpent403
  • 803
  • 16
  • 32

1 Answers1

0

It isn't completely clear what you want to achieve. You usually wouldn't make facets on analyzed fields. Maybe you could explain why you're making shingles so that we can help achieving what you want in a better way.

With the exclude facet parameter you can exclude some specific entry, but evil is not the same as resident evil. If you want to exclude it you need to specify it. Facets are made based on indexed terms, and resident evil is in fact a single term in the index, which is not the same as the term evil.

Given the choice that you already made for indexing and faceting, there is a way to achieve what you want. Elasticsearch has a really powerful scripting module. You can use a script to decide whether each entry should be included in the facet or not like this:

{
  "query": {
    "match_all" : {}
  },
  "facets": {
    "tags": {
      "terms": {
        "field" : "tags",
        "script" : "term.contains('evil') ? true : false"
      }
    }
  }
}
javanna
  • 59,145
  • 14
  • 144
  • 125
  • what should I put in "script": if I have multiple exclude words ['evil','i','a','the'] ? – serpent403 Oct 08 '12 at 11:44
  • Have a look at the [mvel operators](http://mvel.codehaus.org/Operators). I guess you could put them in OR . For instance `term.contains('evil') || term.contains('i')` etc. – javanna Oct 08 '12 at 11:51
  • I actually have a huge data set of such stop keywords. Is this the right way to do it? Is there any alternate way? – serpent403 Oct 08 '12 at 13:27
  • I personally wouldn't do anything like this on a production system, but if you want some more help you should update your question with some more detail. What do you want to achieve? Why are you using shingles? – javanna Oct 08 '12 at 14:14
  • same as this -> http://elasticsearch-users.115913.n3.nabble.com/ShingleFilter-with-Stop-List-for-Tag-Cloud-td3923563.html – serpent403 Oct 09 '12 at 09:30
  • A facet is the right choice to make a tag-cloud. What is not clear is why you need to remove some entries. Can you explain that? – javanna Oct 10 '12 at 18:26