3

I have a string field "myfield.keyword", where entries have the following format:

AAA_BBBB_CC

DDD_EEE_F

I am trying to create a scripted field that outputs the substring before the first _, a scripted field that outputs the substring between the first and second _ and a scripted field that outputs the substring after the second _.

I was trying to use .split('_') to do this, but found that this method is not available in Painless:

def newfield = "";
def path = doc[''myfield.keyword].value;
if (...)
{newfield = path.split('_')[1];} else {newfield="null";}
return newfield

I then tried the workaround suggested here, but found that I must enable regexes in Elastic (which would not be possible in my case):

def newfield = "";
def path = doc[''myfield.keyword].value;
if (...)
{newfield = /_/.split(path)[1];} else {newfield="null";}
return newfield

Is there a way to do this that does presuppose enabling regexes?

EDIT:

Thank you for such an elegante solution Val. This answers the question I asked. My question, however, was not well formed. In particular, the string that needs to be split has four occurrences of '_'. Something like:

AAA_BB_CCC_DD_E

FFF_GGG_HH_JJJJ_KK

So, if I understand correctly, indexOf() and lastIndexOf() cannot give me BB, CCC or DD. I thought that I could adapt your solution, and find the index of the second and third occurrences of _, by using string.indexOf("_", 1) and string.indexOf("_", 2). However, I always get the same result as string.indexOf("_"), without any extra parameters (i.e. the result is always the index of _'s first occurence).

gbs
  • 45
  • 1
  • 1
  • 7

2 Answers2

3

Enabling regular expressions is not terribly complicated, but it requires restarting your cluster and that might not be easy for you depending on the environment.

Another way to achieve this is to do it the "old way". First you create a reusable script for each of the script fields. What that script does is simply find the first, second, third and last occurrence of the _ symbol and returns the split elements. It takes as input the field name to split and the index of the substring to return:

POST _scripts/my-split
{
  "script": {
    "lang": "painless",
    "source": """
      def str = doc[params.field].value;
      def first = str.indexOf("_");
      def second = first + 1 + str.substring(first + 1).indexOf("_");
      def third = second + 1 + str.substring(second + 1).indexOf("_");
      def last = str.lastIndexOf("_");
      def parts = [
           str.substring(0, first), 
           str.substring(first + 1, second), 
           str.substring(second + 1, third), 
           str.substring(third + 1, last), 
           str.substring(last + 1)
      ];
      return parts[params.index];
    """
  }
}

Then you can simply define one script field for each of the parts like this:

POST test/_search
{
  "script_fields": {
    "first": {
      "script": {
        "id": "my-split",
        "params": {
          "field": "myfield.keyword",
          "index": 0
        }
      }
    },
    "second": {
      "script": {
        "id": "my-split",
        "params": {
          "field": "myfield.keyword",
          "index": 1
        }
      }
    },
    "third": {
      "script": {
        "id": "my-split",
        "params": {
          "field": "myfield.keyword",
          "index": 2
        }
      }
    }
  }
}

The response you get will look like this:

  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "ykS-l3UBeO1HTBdDvTZd",
    "_score" : 1.0,
    "fields" : {
      "first" : [
        "AAA"
      ],
      "second" : [
        "BBBB"
      ],
      "third" : [
        "CC"
      ]
    }
  }
Val
  • 207,596
  • 13
  • 358
  • 360
  • Thank you for such an elegante solution Val. This answers the question I asked. My question, however, was not well formed. In particular, the string that needs to be split has four occurrences of '_'. Something like: and so indexOf() and last – gbs Nov 06 '20 at 12:42
  • I've updated my script which now works with up to four occurrences of `_`. Let me know how it works for you. – Val Nov 09 '20 at 06:06
2

You could use str.splitOnToken("_") and retrieve each result as an array and loop the array for any of your purposes. You can even split on variable tokens such as:

def message = "[LOG] Something something WARNING: Your warning";
def reason = message.splitOnToken("WARNING: ")[1];

So reason will hold the remaining string: Your warning.

George Ts.
  • 91
  • 1
  • 4