6

This question feels very similar to an old question posted here: Retrieve analyzed tokens from ElasticSearch documents, but to see if there are any changes I thought it would make sense to post it again for the latest version of ElasticSearch.

We are trying to search bodies of text in ElasticSearch with the search-query and field-mapping using the snowball stemmer built into ElasticSearch. The performance and results are great, but because we need to have the stemmed text-body for post-analysis we would like to have the search result return the actual stemmed tokens for the text-field per document in the search results.

The mapping for the field currently looks like:

      "TitleEnglish": {
        "type": "string",
        "analyzer": "standard",
        "fields": {
          "english": {
            "type": "string",
            "analyzer": "english"
          },
          "stemming": {
            "type": "string",
            "analyzer": "snowball"
          }
        }
      }

and the search query is performed specifically on TitleEnglish.stemming. Ideally I would like it to return that field, but returning that does not return the analyzed field but the original field.

Does anybody know of any way to do this? We have looked at Term Vectors, but they only seem to be returnable for individual documents or a body of documents, not for a search result?

Or perhaps other solutions like Solr or Sphinx do offer this option?


To add some extra information. If we run the following query:

GET /_analyze?analyzer=snowball&text=Eight issue of Industrial Lorestan eliminate barriers to facilitate the Committees review of

It returns the stemmed words: eight, issu, industri, etc. This is exactly the result we would like back for each matching document for all of the words in the text (so not just the matches).

luckylwk
  • 225
  • 1
  • 8
  • So, the solutions in the question you linked to didn't work for you? What went wrong with them? – femtoRgon Mar 16 '16 at 15:33
  • The term vector answers are not an actual solution (as described above). – luckylwk Mar 16 '16 at 17:39
  • Yes, I saw that, but it didn't explain anything to me. What is a search result if not a document? – femtoRgon Mar 16 '16 at 17:51
  • In our situation a search result is a body of documents (say: 8000 documents) and we don't want to extract the term-vectors for these documents individually as this would be too intensive for real-time analytics. – luckylwk Mar 16 '16 at 17:57

1 Answers1

4

Unless I'm missing something evident, why not simply returning a terms aggregation on the TitleEnglish.stemming field?

{
    "query": {...},
    "aggs" : {
        "stems" : {
            "terms" : { 
                "field" : "TitleEnglish.stemming",
                "size": 50
            }
        }
    }
}

Adding that aggregation to your query, you'd get a breakdown of all the stemmed terms in the TitleEnglish.stemming sub-field from the documents that matched your query.

Val
  • 207,596
  • 13
  • 358
  • 360
  • Thanks for the answer Val. I have tried it and it certainly works for what you are describing. Next to the documents it returns a list of all stemmed tokens that are present in the search returns. It is not really the answer we are looking for though as we now still have to parse each document and map it to its stemmed constituents. – luckylwk Mar 24 '16 at 10:46
  • So you need to have the stemmed tokens returned per document? – Val Mar 24 '16 at 10:55
  • Yes, that would be the situation we are after. I'll update my initial question if that was not clear. – luckylwk Mar 24 '16 at 11:44