40

Trying to access the analyzed/tokenized text in my ElasticSearch documents.

I know you can use the Analyze API to analyze arbitrary text according your analysis modules. So I could copy and paste data from my documents into the Analyze API to see how it was tokenized.

This seems unnecessarily time consuming, though. Is there any way to instruct ElasticSearch to returned the tokenized text in search results? I've looked through the docs and haven't found anything.

Brian Webster
  • 30,033
  • 48
  • 152
  • 225
Clay Wardell
  • 14,846
  • 13
  • 44
  • 65
  • I try both analyze api and term vector and find term vector is more complicated actually b/c to parse its result is more time consuming than parse analyze api result. I was wondering have you gain more insight since you raised this question ? – Qiulang Oct 22 '18 at 07:06
  • Related question [here](https://stackoverflow.com/q/43415139/13762264), I found `docvalue_fields` worked for me – pjpscriv Feb 22 '23 at 23:28

3 Answers3

17

This question is a litte old, but maybe I think an additional answer is necessary.

With ElasticSearch 1.0.0 the Term Vector API was added which gives you direct access to the tokens ElasticSearch stores under the hood on per document basis. The API docs are not very clear on this (only mentioned in the example), but in order to use the API you have to first indicate in your mapping definition that you want to store term vectors with the term_vector property on each field.

mike rodent
  • 14,126
  • 11
  • 103
  • 157
Torsten Engelbrecht
  • 13,318
  • 4
  • 46
  • 48
  • Thru my test I found it was more time consuming to analyze term vector compared to just check analyze api result. – Qiulang Oct 22 '18 at 14:19
16

Have a look at this other answer: elasticsearch - Return the tokens of a field. Unfortunately it requires to reanalyze on the fly the content of your field using the script provided.
It should be possible to write a plugin to expose this feature. The idea would be to add two endpoints to:

  • allow to read the lucene TermsEnum like the solr TermsComponent does, useful to make auto-suggestions too. Note that it wouldn't be per document, just every term on the index with term frequency and document frequency (potentially expensive with a lot of unique terms)
  • allow to read the term vectors if enabled, like the solr TermVectorComponent does. This would be per document but requires to store the term vectors (you can configure it in your mapping) and allows also to retrieve positions and offsets if enabled.
Community
  • 1
  • 1
javanna
  • 59,145
  • 14
  • 144
  • 125
  • I also had a need for the first case there, but I don't care about the frequencies, I just want a list -- so my plan is just to iterate through the field data cache (like the regular term facet does) but without gathering the counts. I have a partially-written plugin for it. – Andrew Clegg Nov 15 '12 at 23:36
  • Nice work, it would be nice if you can share it on github! :) – javanna Nov 16 '12 at 10:03
  • if you mean me -- yes I will when it's done, here: https://github.com/ptdavteam/elasticsearch-approx-plugin – Andrew Clegg Nov 28 '12 at 11:10
  • We have now added a TermList facet to that plugin. It's still a bit experimental. I'd be interested in any feedback if you have a chance to try it out. – Andrew Clegg Jan 17 '13 at 11:40
  • Nice, I'll have a look at it. On the other hand, I haven't had the chance to even start working on the ideas I had...too bad! – javanna Jan 17 '13 at 18:16
  • @javanna, any progress on this plugin? – Rafid May 07 '16 at 11:19
  • @Rafid sorry, this one ended up in the list of "never released" projects of mine. Will update the answer. – javanna May 09 '16 at 12:27
5

You may want to use scripting, however your server should have the scripting enabled.

curl 'http://localhost:9200/your_index/your_type/_search?pretty=true' -d '{
    "query" : {
        "match_all" : { }
    },
    "script_fields": {
        "terms" : {
            "script": "doc[field].values",
            "params": {
                "field": "field_x.field_y"
            }
        }
    }
}'

The default setting for allowing the script depends on the elastic search version, so please check that out from the official documentation.