2

I have a use case where an API i'm calling to retrieve data to put into elasticsearch is returning nulls.

I need to write an ingest pipeline that uses processors to remove all null fields before writing it into elasticsearch. Processors may or may not use painless scripting.

Here is a sample payload that i currently get from the API

{
    "master_desc": "TESTING PART",
    "date_added": "2019-10-24T09:30:03",
    "master_no": {
      "master_no": 18460110,
      "barcode": "NLSKYTEST1-1",
      "external_key": null,
      "umid": null
    }
  }

The pipeline should ideally insert the document as -

{
    "master_desc": "TESTING PART",
    "date_added": "2019-10-24T09:30:03",
    "master_no": {
      "master_no": 18460110,
      "barcode": "NLSKYTEST1-1"
    }
}

Note, the fields are dynamic so i can't write a processor that checks for nulls against a defined set of fields.

Thanks!

raja
  • 61
  • 8
  • https://stackoverflow.com/questions/60538475/how-to-handle-nulls-in-an-elasticsearch-index this might help you – Amit Apr 30 '20 at 03:36

2 Answers2

1

Null fields are not indexed nor are searchable.I have written below pipeline to remove such fields. Please test it before use on all of your scenarios. After posting documents using this pipeline, you won't be able to search null fields using "exists"

Pipeline:

PUT _ingest/pipeline/remove_null_fields
{
  "description": "Remove any null field",
  "processors": [
    {
      "script": {
        "source": """
         // return list of field with null values
          def loopAllFields(def x){
            def ret=[];
            if(x instanceof Map){
              for (entry in x.entrySet()) {
                if (entry.getKey().indexOf("_")==0) { 
                  continue;
                }
                def val=entry.getValue();
                if( val instanceof HashMap ||
                    val instanceof Map || 
                    val instanceof ArrayList)
                {
                  def list=[];
                  if(val instanceof ArrayList)
                  {
                    def index=0;
                    // Call for each object in arraylist
                    for(v in val)
                    {
                      list=loopAllFields(v);
                      for(item in list)
                      {
                          ret.add(entry.getKey()+"["+index+"]."+ item);
                      }
                      index++;
                    }
                  }
                  else
                  {
                    list =loopAllFields(val);
                  }
                  if(list.size()==val.size())
                  {
                    ret.add(entry.getKey());
                  }
                  else{
                        for(item in list)
                        {
                          ret.add(entry.getKey()+"."+ item);
                        }
                  }
                }

                if(val==null)
                {
                  ret.add(entry.getKey());
                }
              }
            }
            return ret;
          }


        /* remove fields from source, recursively deletes fields which part of other fields */
        def removeField(def ctx, def fieldname)
        {
            def pos=fieldname.indexOf(".");
            if(pos>0)
            {
              def str=fieldname.substring(0,pos);
              if(str.indexOf('[')>0 && str.indexOf(']')>0)
              {
                def s=str.substring(0,str.indexOf('['));
                def i=str.substring(str.indexOf('[')+1,str.length()-1);
                removeField(ctx[s][Integer.parseInt(i)],fieldname.substring(pos+1,fieldname.length())); 
              }
              else
              {
                  if(ctx[str] instanceof Map)
                  {
                    removeField(ctx[str],fieldname.substring(pos+1,fieldname.length()));
                  }
              }
            }else{

              ctx.remove(fieldname);
            } 
            return ctx;
        }

          def list=[];
          list=loopAllFields(ctx);
          for(item in list)
          {
             removeField(ctx,item);   
          }
"""
      }
    }
  ]
}

Post Document:

POST index8/_doc?pipeline=remove_null_fields
{
    "master_desc": "TESTING PART",
    "ddd":null,
    "date_added": "2019-10-24T09:30:03",
    "master_no": {
      "master_no": 18460110,
      "barcode": "NLSKYTEST1-1",
      "external_key": null,
      "umid": null
    }
}

Result:

"hits" : [
      {
        "_index" : "index8",
        "_type" : "_doc",
        "_id" : "06XAyXEBAWHHnYGOSa_M",
        "_score" : 1.0,
        "_source" : {
          "date_added" : "2019-10-24T09:30:03",
          "master_no" : {
            "master_no" : 18460110,
            "barcode" : "NLSKYTEST1-1"
          },
          "master_desc" : "TESTING PART"
        }
      }
    ]
jaspreet chahal
  • 8,817
  • 2
  • 11
  • 29
  • Thanks @jaspreet , is the above script written in painless? Just want to confirm cause elastic 6.0 and above only supports painless and i'm not that familiar with it – raja Apr 30 '20 at 08:10
  • @raja it is written in painless. Please test before executing on actual data. I only checked few scenarios – jaspreet chahal Apr 30 '20 at 08:16
  • so the script almost worked. It didn't however eliminate empty objects, empty arrays or empty values. Have posted a more detailed explanation of what i'm seeing above – raja Apr 30 '20 at 09:38
  • the script looks the same, are you sure you updated it? – raja Apr 30 '20 at 18:59
  • @raja Yes. Just needed a small change. I have added these lines if(list.size()==val.size()) { ret.add(entry.getKey()); } else{ for(item in list) { ret.add(entry.getKey()+"."+ item); } – jaspreet chahal May 01 '20 at 00:34
  • Cool, thanks. Will give it a shot later tonight and let you know. Thanks for your help – raja May 01 '20 at 01:32
  • sorry for the delayed response but i got busy with a few other tasks. I tested the script and it eliminated empty arrays but it doesn't remove keys that have null values from nested objects – raja May 05 '20 at 07:37
  • - for example, the doc {"master_no": { "master_no": 18460001 }, "lib_master_audio": [ { "master_no": 1000000, "audio_channel_no": { "audio_channel_no": 10, "audio_channel": "1", "external_key": null } }, { "master_no": 1000001, "audio_channel_no": { "audio_channel_no": 10, "audio_channel": "1", "external_key": null } } ]} still inserts the doc will the keys that have null values for the lib_master_audio objects – raja May 05 '20 at 07:39
  • @raja I have updated this. You will need to clean it up a bit. – jaspreet chahal May 05 '20 at 11:19
  • hey @jaspreet, as i've been testing more, i've found another issue - here is the source ` {"master_no": "1111","lib_master_audio":[ { "master_no": 1111, "audio_desc": "LEFT", "audio_content_desc": "COMPOSITE AUD", "audio_desc2": null, "track_language": "PARISIAN-FRENCH", "audio_note": null}, { "master_no": 1111, "audio_desc": "CENTER", "audio_content_desc": "COMPOSITE AUD", "audio_desc2": null, "track_language": "PARISIAN-FRENCH", "audio_note": null} ]}` – raja May 14 '20 at 08:33
  • there's a lot more key/value pairs in this JSON but while all of them render correctly, this lib_master_audio object (which is an array of objects), returns a bunch of objects with keys that all have null values. Here's a snippet of what i'm getting back "libMasterAudio" : [ { "audioChannelNo" : null, "masterNo" : null, "audioContentDesc" : null, "audioChannel" : null}] There's similarly other objects within this array where i'm getting a bunch of keys with null values, despite the source having values – raja May 14 '20 at 08:36
  • i think i might've uncovered the root cause. the above script on its own executes fine. It's when i try and execute this in conjunction with the script to rename fields from snake case to camel case - for some reason that adversely affects nested objects that have arrays of objects. Basically I've created an ingest pipeline with two processors, the first processor is a call to the script to remove null values, and the next processor is a call to the script that renames fields that have values from snake case to camel case. When the two execute together certain objects are impacted – raja May 14 '20 at 08:43
  • This answer certainly works, but there's also something one can do to test how their pipeline works. You can use the _simulate API in elasticsearcch to run test data against your pipeline. See this link: https://www.elastic.co/guide/en/elasticsearch/reference/master/simulate-pipeline-api.html – DivDiff Mar 22 '21 at 11:47
0

@Jaspreet, so the script almost worked. It didn't however eliminate empty objects, empty arrays or empty values. Here is a doc i tried to index -

{
    "master_desc": "TESTING PART",
    "date_added": "2019-10-24T09:30:03",
    "master_no": {
      "master_no": 18460110,
      "barcode": "NLSKYTEST1-1",
      "external_key": null,
      "umid": null
    },
    "remote_sync_state": "",
    "lib_title_footage": [],
    "prj_no": {
       "prj_no": null,
       "prj_desc": null,
  }

The above returned -

{
    "master_desc": "TESTING PART",
    "date_added": "2019-10-24T09:30:03",
    "master_no": {
      "master_no": 18460110,
      "barcode": "NLSKYTEST1-1"
    },
    "remote_sync_state": "",
    "lib_title_footage": [ ],
    "prj_no": { }

I tried updated the script to have the condition check for these patterns but got a compile error unfortunately.

raja
  • 61
  • 8