0

I have a fairly large document that i'm ingesting into elasticsearch (70-80 attributes). And there can be new fields that come in on a regular basis. I can use the rename operator to effectively rename each field but I was wondering if there was a more efficient way to dynamically rename every field using a script.

Current format for each field can be either all lower case separated by underscore or a combination of upper and lower case separated by underscore. I need to rename each field as follows -

(eg source field) TV_field_1 --> (needs to be renamed to) tvField1

(eg source field) feature_title_no --> (needs to be renamed to) featureTitleNo

Thanks

raja
  • 61
  • 8

1 Answers1

2

Since rename processor doesn't support replacing patterns , have written a custom script

PUT _ingest/pipeline/snakeCaseToCamelCase
{
  "description": "Convert snake case to camel case",
  "processors": [
    {
      "script": {
        "source": """
            // Iterate through all keys and create list of fields with snake case
            def loopAllFields(def x){
              def ret=[];
              if(x instanceof Map){
                for (entry in x.entrySet()) {
                  if (entry.getKey().indexOf("_")==0) { 
                    continue;
                  }
                  // Get value
                  def val=entry.getValue();
                  if(entry.getKey().indexOf("_")>-1)
                  {
                      ret.add(entry.getKey());
                  }
                  // If further object
                  if(val instanceof Map|| val instanceof HashMap)
                  {
                    def list =loopAllFields(val);
                    for(item in list)
                    {
                      ret.add(entry.getKey()+"."+ item);
                    }
                  }
                  // If array
                  if(val instanceof ArrayList)
                  {
                    for(v in val)
                    {
                       def list =loopAllFields(v);
                       for(item in list)
                       {
                          ret.add(entry.getKey()+"."+ item);
                       }
                    }
                  }
                }
              }
              return ret;
            }

            // Create a camel case field and delete snake case field
            def renameField(def ctx, def fieldName)
            {
               def str=splitText(fieldName,'.');
               if(str.length<=1)
               {
                  def newField=snakeToCamel(fieldName);
                  ctx[newField]=ctx[fieldName];
                  ctx.remove(fieldName);
               }
               else
               {
                   if(ctx[str[0]] instanceof ArrayList)
                   {
                       def fld=combineArray(str);
                       for(v in ctx[str[0]])
                       {
                          renameField(v,fld);
                       }
                   }
                   if(ctx[str[0]] instanceof Map)
                   {
                      def fld=combineArray(str);
                      renameField(ctx[str[0]],fld); 
                    }
               }
               return 1;
            }

            def combineArray(def str)
            {
              def fld="";
              for(int i=1;i<str.length;i++)
              {
                if(fld=="")
                {
                   fld+=str[i];
                }
                else
                {
                   fld+="."+str[i];
                }
              }
              return fld;
            }

            // Convert field name from snake case to camel case
            def snakeToCamel(def s){
              //def str=/_/.split(s);
              def str=splitText(s,'_');
              def fieldName="";
              for(int i=0;i<str.length;i++)
              {
                 if(i==0)
                 {
                   fieldName+=str[i].toLowerCase();
                 }
                 else{
                    if(str[i].length()==1)
                    {
                      fieldName+=str[i].substring(0,1).toUpperCase();
                    }else{
                      fieldName+=str[i].substring(0,1).toUpperCase()+str[i].substring(1,str[i].length());
                    }
                 }
              }
              return fieldName;
            }

            def splitText(def str, def seperator)
            {
              def pos= str.indexOf(seperator);
              def ret=[];
              while(pos>0)
              {
                 def split=str.substring(0,pos);
                 def rest= str.substring(pos+1,str.length());
                 ret.add(split);
                 pos=rest.indexOf(seperator);
                 if(pos==-1)
                 {
                   ret.add(rest);
                 }
                 str=rest;
              }
              return ret;
            }

            def fields=loopAllFields(ctx);
            fields.sort((s1, s2) -> s2.length() - s1.length());
            for(field in fields)
            {
              renameField(ctx,field);
            }

"""
      }
    }
  ]
}
jaspreet chahal
  • 8,817
  • 2
  • 11
  • 29
  • Thanks @jaspreet. I'm running into one issue here though - since my ES domain is running on AWS, i don't have access to the elasticsearch.yml file and as a result regex for scripts are disabled. This is causing a compile error for the above script due to the regex expression being used in this line ```def str=/\\./.split(fieldName)``` – raja May 07 '20 at 06:18
  • @raja def splitText(def str) { def pos= str.indexOf('.'); def ret=[]; while(pos>0) { def split=str.substring(0,pos); def rest= str.substring(pos+1,str.length()); ret.add(split); pos=rest.indexOf('.'); if(pos==-1) { ret.add(rest); } } return ret; } call instead of regex – jaspreet chahal May 07 '20 at 06:43
  • Thanks @jaspreet, did that and the script compiled but get the following error at runtime - ```"ScriptException[runtime error]; nested: PainlessError[The maximum number of statements that can be executed in a loop has been reached.];``` – raja May 07 '20 at 21:01
  • ```"script_stack": [ "while(pos>0) \n { \n def ", "^---- HERE" ],``` – raja May 07 '20 at 21:02
  • Substituted two statements in the original script at the following places to call the above ```splitText``` function ```def renameField(def ctx, def fieldName) { def str=splitText(fieldName); ....``` and ```def snakeToCamel(def s){ def str=splitText(s); ....``` – raja May 07 '20 at 21:04
  • @raja can you add sample doc for which error is comming – jaspreet chahal May 08 '20 at 08:41
  • The doc is quite large @jaspreet so there's no way for me to share that unfortunately – raja May 12 '20 at 00:54
  • { "master_no": { "master_no": 111111111, "barcode": "EEEEEEEEEE" }, "master_desc": "test Custom Patch - test", "barcode": "EEEEEEEEEE", "master_status": "I", "length": "0:00", "lib_element": [ { "master_no": 1100000, "element_master_no": { "element_master_no": 111111111 }, "lib_element_desc": "NO FOOTAGE test", "element_barcode": "222", "element_master_status": "I", "element_type": "O" } ] } – raja May 12 '20 at 01:04
  • was able to trim down the json to a small document that's a subset and re-ran the test. still getting the error for the above json doc @jaspreet – raja May 12 '20 at 01:05
  • uncovered a bug in the script where the values for nested objects aren't being retained and the structures of objects within the nested object are being manipulated. Here is the source doc - `{ "master_no": { "master_no": 18460001 }, "lib_master_audio": [ { "master_no": 18460001, "audio_channel_no": { "audio_channel_no": 10, "audio_channel": "1" } }, { "master_no": 18460001, "audio_channel_no": { "audio_channel_no": 11, "audio_channel": "2" } } ] }` – raja May 18 '20 at 23:37
  • here is the destination doc `{ "libMasterAudio" : [ { "audioChannelNo" : null, "masterNo" : null }, { "audioChannelNo" : null, "masterNo" : null } ], "masterNo" : { "masterNo" : 18460001 } }` – raja May 18 '20 at 23:39
  • hi @jaspreet ,i was able to narrow down the issue to the foll- the script works if there is an array of 1 object, however for arrays of multiple objects it does not. The current behavior when there are multiple objects is to return each object with the keys renamed but with null values. I modified the `loopAllFields` function to keep track of the index position of objects in an array, the result was that o/p no longer displayed null values but now the fields weren't being renamed. I proceeded to look at the `renameField` function but couldn't fix that. Any help you can provide would be helpful – raja May 19 '20 at 10:02