0

I've never used Jolt Transform before and I'm not sure how to fix what I've done, so I apologize if this is actually a really easy fix.

I have two XML files (dummy versions below, actual contains PII) that I merged together using MergeRecord in NiFi. Because of the way that the output is (one flow file with an array of JSONS) it was suggested that I use JoltTransform to merge them together properly. I was pointed to this question on how to essentially do a streaming join in NiFi (which is what I needed).

While this works for the most part, I am still having one issue. All of the tags on my "base" level (FatherID, FatherName, BirthDate, etc.) are turned into arrays. I need these to not be arrays, because I'd like to use the same combined schema I used in MergeRecord (which does not have those fields as arrays).

Is there something I need to change in the spec, or do I need to do another JoltTransform (which is fine)?

Input XML 1

<?xml version="1.0" encoding="UTF-8"?>
<FoundingFathers>
   <FatherID>1234</FatherID>
   <FatherName>George Washington</FatherName>
   <ResidentialInformation>
      <Name>Mount Vernon</Name>
      <StreetAddress>3200 Mount Vernon Hwy</StreetAddress>
       <City>Mt Vernon</City>
       <State>VA</State>
       <ZipCode>22121</ZipCode>
   </ResidentialInformation>
    <BirthDate>1732-02-22</BirthDate>
</FoundingFathers>

Input XML 2

<?xml version="1.0" encoding="UTF-8"?>
<DOC>
   <DOCID>1234</DOCID>
   <FATHERNAME>George Washington</FATHERNAME>
   <RAW_TXT>George Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.</RAW_TXT>
   <TXT>
      <S>
         <FATHERNAME>George Washington</FATHERNAME>
         <ESTATENAME>Mount Vernon</>
         <ESTATEPLACE>VA</ESTATEPLACE>
      </S>
      <S>
         <OWNER>Washington family</OWNER>
         <YEAROWNED>1674</YEAROWNED>
      </S>
      <S>
         <BUILTIN>1734</BUILTIN>
         <BUILTBY>Washington's father</BUILTBY>
      </S>
   </TXT>
</DOC>

MergeRecord Configs

Record Reader: XMLReader
Record Writer: JsonRecordSetWriter
Merge Strategy: Bin-Packing Algorithm
Correlation Attribute Name: FatherID
Attribute Strategy: Keep All Unique Attributes
Minimum Number of Records: 2
Maximum Number of Records: 2
Minimum Bin Size: 0 B
Maximum Bin Size: No value set
Max Bin Age: No value set
Maximum Number of Bins: 10

Schema

{
  "namespace": "ff",
  "name": "founders",
  "type": "record",
  "fields": [
    {"name":"FatherID", "type": ["string", "null"], "default": null},
    {"name":"FatherName", "type": ["string", "null"], "default": null},
    {"name":"ResidentialInformation", "type": ["null", {
      "name": "ResidentialInformation", "type": "array", "items": {
        "name": "ResidentialInformation", "type": "record", "fields": [
          {"name": "Name", "type": ["string","null"], "default":null},
          {"name": "StreetAddress", "type": ["string","null"], "default":null},
          {"name": "City", "type": ["string","null"], "default":null},
          {"name": "State", "type": ["string","null"], "default":null},
          {"name": "ZipCode", "type": ["string","null"], "default":null}
        ]
      }
    }]},
    {"name":"BirthDate", "type": ["string", "null"], "default": null},
    {"name": "DOCID", "type": ["string", "null"], "default": null},
    {"name": "FINAME", "type": ["string", "null"], "default": null},
    {"name": "CUSTNAME", "type": {"type": "array", "items": "string"}},
    {"name": "RAW_TXT", "type": {"type": "array", "items": "string"}},
    {"name": "TXT", "type": {
      "name": "TXT", "type": "record", "namespace": "txt.sar", "fields": [
        {"name": "S", "type": {
          "type": "array", "items": {
            "name": "RecordInArray", "type": "record", "fields": [
              {"name": "FATHERNAME", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "ESTATENAME", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "ESTATEPLACE", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "OWNER", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "YEAROWNED", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "BUILTIN", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "BUILTBY", "type": {"type": "array", "items": ["string","null"]}}
            ]
          }
        }}
      ]
    }}
  ]}

jolt spec (shift operation)

{
    "*": {
      "*": "&"
    }
}

Actual Output

[ {
  "FatherID" : ["1234", null],
  "FatherName" : ["George Washington", null],
  "ResidentialInformation" : [ {
    "Name" : "Mount Vernon",
    "StreetAddress" : "3200 Mount Vernon Hwy",
    "City" : "Mt Vernon",
    "State" : "VA",
    "ZipCode" : "22121"
  } ],
  "BirthDate" : ["1732-02-22", null],
  "DOCID" : "1234",
  "FATHERNAME" : "George Washington",
  "RAW_TXT" : [ "\nGeorge Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.\n" ],
  "TXT" : {
    "S" : [ {
      "FATHERNAME" : [ "George Washington" ],
      "ESTATENAME" : [ "Mount Vernon" ],
      "ESTATEPLACE" : [ "VA" ]
    }, {
      "OWNER" : [ "Washington family" ],
      "YEAROWNED" : [ "1674" ]
    }, {
      "BUILTIN" : [ "1734" ],
      "BUILTBY" : [ "Washington's father" ]
    } ]
  }
} ]

Expected Output

[ {
  "FatherID" : "1234",
  "FatherName" : "George Washington",
  "ResidentialInformation" : [ {
    "Name" : "Mount Vernon",
    "StreetAddress" : "3200 Mount Vernon Hwy",
    "City" : "Mt Vernon",
    "State" : "VA",
    "ZipCode" : "22121"
  } ],
  "BirthDate" : "1732-02-22",
  "DOCID" : "1234",
  "FATHERNAME" : "George Washington",
  "RAW_TXT" : [ "\nGeorge Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.\n" ],
  "TXT" : {
    "S" : [ {
      "FATHERNAME" : [ "George Washington" ],
      "ESTATENAME" : [ "Mount Vernon" ],
      "ESTATEPLACE" : [ "VA" ]
    }, {
      "OWNER" : [ "Washington family" ],
      "YEAROWNED" : [ "1674" ]
    }, {
      "BUILTIN" : [ "1734" ],
      "BUILTBY" : [ "Washington's father" ]
    } ]
  }
} ]
carousallie
  • 776
  • 1
  • 7
  • 25

1 Answers1

1

MergeContent and MergeRecord are usually for merging two or more flowfiles whose schema is the same, such as bundling individual JSON objects into a larger array.

You might be able to use LookupRecord using an XMLFileLookupService, that will get the content of the second XML file and insert it into the record of the first flowfile at the location you choose. The part I'm not sure of is how you'd lookup the DOCID (to match the FatherID), then return each of the fields in the top-level DOC element.

If that doesn't work, you can always try ExecuteScript or InvokeScriptedProcessor (or a ScriptedLookupService) to do the "merge" manually.

EDIT (additional info): If the JOLT spec always puts the non-null value first (or all elements in the array will be null or identical), you might be able to add a spec to your chain that replaces the field value with the first element in its array.

mattyb
  • 11,693
  • 15
  • 20
  • I used the JoltTransform because I realized that the schema was telling NiFi to expect all fields to already exist in both files, which then successfully combined them into one (aside from the one issue of now the fields are arrays). My concern with LookupRecord is, as you said, how to look up the DOCID and FatherID. Our client really wants us to avoid scripts, but that might be my only option if I can't figure out a way with jolt. I really like the idea of keeping MergeRecord because of how it only merged together the files with the matching id attribute. – carousallie Apr 23 '19 at 16:42
  • I updated my answer to try something with an additional spec. – mattyb Apr 23 '19 at 17:02
  • Thank you. It does always put the non-null value first; do you know exactly what that spec might look like? I've tried looking around, but I'm having trouble understanding the language of Jolt transformations. – carousallie Apr 23 '19 at 19:22
  • I make heavy use of the online Jolt app and its examples, here is a link to the List examples (where you'd use `=firstElement()`): http://jolt-demo.appspot.com/#modify-listFunctions – mattyb Apr 23 '19 at 19:38
  • So just to clarify, I would need to hard code all of the fields that I need to change from arrays to single field values? Thank you so much for all of your help. – carousallie Apr 23 '19 at 19:43
  • You might be able to use `*` in your spec to iterate over all fields, you'd either have to hard-code the ones you *don't* want to change, then apply the `firstElement()` function to everything else, or vice versa – mattyb Apr 24 '19 at 15:14