ElasticSearch: Count Frequency of Occurrence of a Set of Words in a Set of Documents

Question

I have the following ElasticSearch query:

{
  "from": 0,
  "sort": [
    "_score"
  ],
  "fields": [
    "id",
    "title",
    "text"
  ],
  "query": {
    "query_string": {
      "fields": [
        "title",
        "text"
      ],
      "query": "(\"green socks\" OR \"red socks\") AND NOT (\"yellow\" OR \"blue\")"
    }
  },
  "size": 100
}

This works fine, and returns a set of documents of around 80,000 documents.

I would like to calculate the following upon this set of 80,000 documents (i.e. the set of documents that matches "query": "(\"green socks\" OR \"red socks\") AND NOT (\"yellow\" OR \"blue\")"):

For each of "green socks" calculate the no. of documents within the 80,000 that contain "green socks" at least once.
For each of "red socks" calculate the no. of documents within the 80,000 that contain "red socks" at least once.
And so on, for all the other words/phrases that are in the "left-hand" side of the above query string.
There are actually about 50 - 100 such words/phrases in each query string, so another such 50 - 100 "red socks" words/phrases in the query string I'm actually running.

This feels like an aggregation query, but I just can't see it.
Any help v gratefully received,

Thanks,
R

score 2 · Accepted Answer · answered Apr 24 '15 at 10:31

You have guessed right. This is the job of aggregation. But aggregations can be slow if your mapping is not right. For example if you do aggregation on a analyzed field like "text" which may contain lots of tokens it will lead to high memory usage and in turn hamper performance.

Now coming to you requirement, you want the count of documents containing say "red sock" within the set of 80000 results. You want the term to be present anywhere(means in title or text field) or only in a particular field. If you want it to be in any field then you need to first combine the fields in a single field.

You can use a simple terms aggregation along with your query which will give count of all the terms in the field.

{
  .................
  "query": {
    "query_string": {
      "fields": [
        "title",
        "text"
      ],
      "query": "(\"green socks\" OR \"red socks\") AND NOT (\"yellow\" OR \"blue\")"
    }
  },  
  "aggs" : {
    "my-terms" : {
        "terms" : {
            "field" : "title"
        }
    }
}

  "size": 100
}

If you want count only for certain set of terms as "red socks" "green sock" etc then you should use filters aggregation

{
      .................
      "query": {
        "query_string": {
          "fields": [
            "title",
            "text"
          ],
          "query": "(\"green socks\" OR \"red socks\") AND NOT (\"yellow\" OR \"blue\")"
        }
      },  
      "aggs" : {
        "my-terms" : {
          "filters" : {
            "filters" : {
              "red socks" :   { "term" : { "title" : "red sock"   }},
              "green sock" : { "term" : { "title" : "green sock" }},
               ......
              and so on...
             }
         }
    }

      "size": 100
    }

Word of caution is that as I mentioned earlier the field mapping will impact the performance and memory requirement of your aggregation.

Many thanks. In the end the "filters aggregation" was what was needed. Because this is a reporting tool as opposed to a critical front-end service, "filters aggregation" seems appropriate so far. — Roland Dunn, Apr 27 '15 at 14:28

score 0 · Answer 2 · answered Apr 24 '15 at 08:49

0

Unless you really have exabytes of data, I recommend working with Lucene instead of ElasticSearch to reduce the overhead. There is no use in serializing data in JSON and sending it over the network when you could access it directly more efficiently...

Unless you want to load 80000 documents, I suggest you send two more requests:

"green socks" AND NOT ("yellow" OR "blue")
"red socks" AND NOT ("yellow" OR "blue")

to get the counts you are interested in.

It is possible to do all three at once - if you dig deep into the Lucene API, instead of going through the text search API. It's all set intersections, nothing spectacular. But again, you don't want to transmit such data over the network without need.

answered Apr 24 '15 at 08:49

Has QUIT--Anony-Mousse

76,138
12
138
194

2

I disagree with your first comment. ElasticSearch or Solr isn't about size of the data, it's about providing a layer of abstraction over top of Lucene. What you're saying is akin to "no point in writing in C when you could just code it up directly in assembly for better performance" – J. Dimeo Dec 29 '16 at 19:08
You can use Lucene low-level, but it already comes with plenty of "abstraction" (these layers aren't actually abstracting anything, just like C is not an abstraction of assembly). Abstraction would be if you can actually exchange the underlying engine, but Solr and ElasticSearch can't use e.g. Xapian instead of Lucene for all I know. They are primarily a web server API for Lucene; and they make many things _very_ complicated to do that are damn easy in Lucene. – Has QUIT--Anony-Mousse Dec 29 '16 at 22:35

ElasticSearch: Count Frequency of Occurrence of a Set of Words in a Set of Documents

2 Answers2

Linked