2

I have constructed a elastic search query with a filter and in the filter context and I am writing a painless script to filter some documents based on the body of the text field. However, when I want to access the text field, I get a list of terms instead of the original text. I am looking for a way to access the original text body in the painless script instead of a list of terms. Alternatively, I would like to access the term frequency vector of the document in this context if access to the body of the text is not possible.

For instance if I run this query:

GET twitter/_search
{
  "query": {
      "bool": { 
      "must":{
        "term" : { "body" : "spark" }
      },
      "filter": [
        {
        "script" : {
                    "script" : {
                        "lang": "painless",
                        "source": """
                          String text = doc['body'].toString();
                          Debug.explain(text);
                         return true;
                      """

                    }
                }
      }
      ]
      } 

    }
}

I get this response :

  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 4,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 2,
        "index" : "twitter",
        "node" : "AClIunrSRUKb1gbhBz-JoQ",
        "reason" : {
          "type" : "script_exception",
          "reason" : "runtime error",
          "painless_class" : "java.lang.String",
          "to_string" : "[and, by, cutting, doug, hadoop, jack, jim, lucene, made, spark, the, was]",
          "java_class" : "java.lang.String",
          "script_stack" : [
            "Debug.explain(text);\n                         ",
            "              ^---- HERE"
          ],
          "script" : """
                          String text = doc['body'].toString();
                          Debug.explain(text);
                         return true;
                      """,
          "lang" : "painless",
          "caused_by" : {
            "type" : "painless_explain_error",
            "reason" : null
          }
        }
      }
    ]
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

As you can see the debug shows that the doc['body'].toString() is in fact a list of terms [and, by, cutting, doug, hadoop, jack, jim, lucene, made, spark, the, was]. What I would like to have is to access to the original text which in this example is "body" : "The Lucene was made by Doug Cutting and the hadoop was made by Jim and Spark was made by jack"

NOTE: I have set the "fielddata": true and "store":true on this field and also indexed the document in a body.exact field so that terms wont get analyzed but nevertheless my problem is that I can't access the original text in the script in the filter context and I always get the list of unique terms.

Many thanks for your help!

Rouzbeh
  • 21
  • 1
  • 3
  • [stored fields are always stored as arrays](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-store.html), if you want the access the original document text you can [add `_source = true`](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-source-filtering.html) (if it's available) to your `GET` – Nate Apr 23 '20 at 17:14
  • Thanks Nate, I added `"_source":true` in the query but don't how to access it in the painless script. When I use` _source` in the filter context, it says unknown `Variable [_source] is not defined."` – Rouzbeh Apr 23 '20 at 20:09

2 Answers2

2

You can use the keyword datatype:

PUT twitter
{
  "mappings": {
    "_doc": {
      "properties": {
        "body": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}
GET twitter/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "body": "spark"
        }
      },
      "filter": [
        {
          "script": {
            "script": {
              "lang": "painless",
              "source": """
                          String text = doc['body.keyword'].toString();
                          Debug.explain(text);
                         return true;
"""
            }
          }
        }
      ]
    }
  }
}

yielding

"painless_class" : "java.lang.String",
          "to_string" : "[The Lucene was made by Doug Cutting and the hadoop was made by Jim and Spark was made by jack]",
          "java_class" : "java.lang.String",
          "script_stack" : [
            "Debug.explain(text);\n                         ",
            "              ^---- HERE"
          ],
          "script" : """
                          String text = doc['body.keyword'].toString();
                          Debug.explain(text);
                         return true;
""",
Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
0

One solution that I have found so far is to use multi-fields and have a sub-field, for example body.raw, that is indexed as keyword and in that case if we call doc['body.raw'].value.toString(); , we would get the original text. I still like to find a solution where I don't have to index two fields and get the original text from a _source or something like that.

Rouzbeh
  • 21
  • 1
  • 3