elasticsearch - Return the tokens of a field

Question

How can I have the tokens of a particular field returned in the result

For example, A GET request

curl -XGET 'http://localhost:9200/twitter/tweet/1'

returns

{
    "_index" : "twitter",
    "_type" : "tweet",
    "_id" : "1", 
    "_source" : {
        "user" : "kimchy",
        "postDate" : "2009-11-15T14:12:12",
        "message" : "trying out Elastic Search"
    } 
}

I would like to have the tokens of '_source.message' field included in the result

score 29 · Accepted Answer · edited Dec 10 '20 at 07:28

There is also another way to do it using the following script_fields script:

curl -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/test-idx/_search?pretty=true' -d '{
    "query" : {
        "match_all" : { }
    },
    "script_fields": {
        "terms" : {
            "script": "doc[field].values",
            "params": {
                "field": "message"
            }
        }

    }
}'

It's important to note that while this script returns the actual terms that were indexed, it also caches all field values and on large indices can use a lot of memory. So, on large indices, it might be more useful to retrieve field values from stored fields or source and reparse them again on the fly using the following MVEL script:

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import java.io.StringReader;

// Cache analyzer for further use
cachedAnalyzer=(isdef cachedAnalyzer)?cachedAnalyzer:doc.mapperService().documentMapper(doc._type.value).mappers().indexAnalyzer();

terms=[];
// Get value from Fields Lookup
//val=_fields[field].values;

// Get value from Source Lookup
val=_source[field];

if(val != null) {
  tokenStream=cachedAnalyzer.tokenStream(field, new StringReader(val)); 
  CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute); 
  while(tokenStream.incrementToken()) { 
    terms.add(termAttribute.toString())
  }; 
  tokenStream.close(); 
} 
terms

This MVEL script can be stored as config/scripts/analyze.mvel and used with the following query:

curl 'http://localhost:9200/test-idx/_search?pretty=true' -d '{
    "query" : {
        "match_all" : { }
    },
    "script_fields": {
        "terms" : {
            "script": "analyze",
            "params": {
                "field": "message"
            }
        }
    
    }
}'

It is scary, but fun. :) I wish it was possible to access IndexReader in DocLookup (it's there, but it's private at the moment). Then it would have been possible to use term vectors instead of re-analyzing the text the second time. — imotov, Nov 01 '12 at 18:11
Yeah, sure. Wouldn't it be nice also to read the term vectors without scripts, maybe through a plugin? — javanna, Nov 01 '12 at 18:33
@imotov is there any way of getting the correct analyzer and field type for a field automatically from this script? I'd like to use this functionality in https://metacpan.org/module/Elastic::Model::Role::Doc#terms_indexed_for_field- — DrTech, Nov 02 '12 at 12:37
Very good answer, thanks. Maybe it'd be useful to pluginize it? — Igor Kupczyński, Nov 02 '14 at 12:04
Please add tokenStream.reset(); before while cycle (otherwise it fails on my setup). — usamec, Apr 08 '15 at 09:26
I don't think this is possible to achieve without writing a plugin in elasticsearch 5.0 and above. — imotov, Nov 30 '16 at 22:58

score 7 · Answer 2 · edited Apr 29 '21 at 06:39

If you mean the tokens that have been indexed you can make a terms facet on the message field. Increase the size value in order to get more entries back, or set to 0 to get all terms.

Lucene provides the ability to store the term vectors, but there's no way to have access to it with elasticsearch by now (as far as I know).

Why do you need that? If you only want to check what you're indexing you can have a look at the analyze api.

score 1 · Answer 3 · answered Sep 13 '22 at 16:05

Nowadays, it's possible with the Term vectors API:

curl http://localhost:9200/twitter/_termvectors/1?fields=message

Result:

{
  "_index": "twitter",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "message": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        "elastic": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 11,
              "end_offset": 18
            }
          ]
        },
        "out": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 7,
              "end_offset": 10
            }
          ]
        },
        "search": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 3,
              "start_offset": 19,
              "end_offset": 25
            }
          ]
        },
        "trying": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 6
            }
          ]
        }
      }
    }
  }
}

Note: Mapping types (here: tweets) have been removed in Elasticsearch 8.x (see migration guide).

elasticsearch - Return the tokens of a field

3 Answers3

Linked