1

I have two XML files (dummy versions below, actual contains PII) that I am attempting to merge together into one flow file using MergeRecord. My flow is that I get the files, I use ExtractText to extract the FatherID/DOCID and add it as an attribute, then use that in the MergeRecord config. My MergeRecord config includes an XML reader, a JSON writer, and an Avro schema which I’ve pasted below.

However, my output is one flow file that contains an array of JSONs rather just one JSON. This is a problem, as I will be putting this into Elasticsearch and if I put the one flow file into Elasticsearch, it treats it as two separate entries. I’ve heard some recommendations about using JoltTransform but I wouldn’t even know where to begin.

Input XML 1

<?xml version="1.0" encoding="UTF-8"?>
<FoundingFathers>
   <FatherID>1234</FatherID>
   <FatherName>George Washington</FatherName>
   <ResidentialInformation>
      <Name>Mount Vernon</Name>
      <StreetAddress>3200 Mount Vernon Hwy</StreetAddress>
       <City>Mt Vernon</City>
       <State>VA</State>
       <ZipCode>22121</ZipCode>
   </ResidentialInformation>
    <BirthDate>1732-02-22</BirthDate>
</FoundingFathers>

Input XML 2

<?xml version="1.0" encoding="UTF-8"?>
<DOC>
   <DOCID>1234</DOCID>
   <FATHERNAME>George Washington</FATHERNAME>
   <RAW_TXT>George Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.</RAW_TXT>
   <TXT>
      <S>
         <FATHERNAME>George Washington</FATHERNAME>
         <ESTATENAME>Mount Vernon</>
         <ESTATEPLACE>VA</ESTATEPLACE>
      </S>
      <S>
         <OWNER>Washington family</OWNER>
         <YEAROWNED>1674</YEAROWNED>
      </S>
      <S>
         <BUILTIN>1734</BUILTIN>
         <BUILTBY>Washington's father</BUILTBY>
      </S>
   </TXT>
</DOC>

MergeRecord Configs

Record Reader: XMLReader
Record Writer: JsonRecordSetWriter
Merge Strategy: Bin-Packing Algorithm
Correlation Attribute Name: FatherID
Attribute Strategy: Keep All Unique Attributes
Minimum Number of Records: 2
Maximum Number of Records: 2
Minimum Bin Size: 0 B
Maximum Bin Size: No value set
Max Bin Age: No value set
Maximum Number of Bins: 10

Schema

{
  "namespace": "ff",
  "name": "founders",
  "type": "record",
  "fields": [
    {"name":"FatherID", "type": ["string", "null"], "default": null},
    {"name":"FatherName", "type": ["string", "null"], "default": null},
    {"name":"ResidentialInformation", "type": ["null", {
      "name": "ResidentialInformation", "type": "array", "items": {
        "name": "ResidentialInformation", "type": "record", "fields": [
          {"name": "Name", "type": ["string","null"], "default":null},
          {"name": "StreetAddress", "type": ["string","null"], "default":null},
          {"name": "City", "type": ["string","null"], "default":null},
          {"name": "State", "type": ["string","null"], "default":null},
          {"name": "ZipCode", "type": ["string","null"], "default":null}
        ]
      }
    }]},
    {"name":"BirthDate", "type": ["string", "null"], "default": null},
    {"name": "DOCID", "type": ["string", "null"], "default": null},
    {"name": "FINAME", "type": ["string", "null"], "default": null},
    {"name": "CUSTNAME", "type": {"type": "array", "items": "string"}},
    {"name": "RAW_TXT", "type": {"type": "array", "items": "string"}},
    {"name": "TXT", "type": {
      "name": "TXT", "type": "record", "namespace": "txt.sar", "fields": [
        {"name": "S", "type": {
          "type": "array", "items": {
            "name": "RecordInArray", "type": "record", "fields": [
              {"name": "FATHERNAME", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "ESTATENAME", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "ESTATEPLACE", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "OWNER", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "YEAROWNED", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "BUILTIN", "type": {"type": "array", "items": ["string","null"]}},
              {"name": "BUILTBY", "type": {"type": "array", "items": ["string","null"]}}
            ]
          }
        }}
      ]
    }}
  ]}

Actual Output

[ {
  "FatherID" : "1234",
  "FatherName" : "George Washington",
  "ResidentialInformation" : [ {
    "Name" : "Mount Vernon",
    "StreetAddress" : "3200 Mount Vernon Hwy",
    "City" : "Mt Vernon",
    "State" : "VA",
    "ZipCode" : "22121"
  } ],
  "BirthDate" : "1732-02-22",
  "DOCID" : null,
  "FATHERNAME" : null,
  "RAW_TXT" : [ ],
  "TXT" : null
}, {
  "FatherID" : null,
  "FatherName" : null,
  "BirthDate" : null,
  "DOCID" : "1234",
  "FATHERNAME" : "George Washington",
  "RAW_TXT" : [ "\nGeorge Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.\n" ],
  "TXT" : {
    "S" : [ {
      "FATHERNAME" : [ "George Washington" ],
      "ESTATENAME" : [ "Mount Vernon" ],
      "ESTATEPLACE" : [ "VA" ]
    }, {
      "OWNER" : [ "Washington family" ],
      "YEAROWNED" : [ "1674" ]
    }, {
      "BUILTIN" : [ "1734" ],
      "BUILTBY" : [ "Washington's father" ]
    } ]
  }
} ]

Expected Output

[ {
  "FatherID" : "1234",
  "FatherName" : "George Washington",
  "ResidentialInformation" : [ {
    "Name" : "Mount Vernon",
    "StreetAddress" : "3200 Mount Vernon Hwy",
    "City" : "Mt Vernon",
    "State" : "VA",
    "ZipCode" : "22121"
  } ],
  "BirthDate" : "1732-02-22",
  "DOCID" : "1234",
  "FATHERNAME" : "George Washington",
  "RAW_TXT" : [ "\nGeorge Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.\n" ],
  "TXT" : {
    "S" : [ {
      "FATHERNAME" : [ "George Washington" ],
      "ESTATENAME" : [ "Mount Vernon" ],
      "ESTATEPLACE" : [ "VA" ]
    }, {
      "OWNER" : [ "Washington family" ],
      "YEAROWNED" : [ "1674" ]
    }, {
      "BUILTIN" : [ "1734" ],
      "BUILTBY" : [ "Washington's father" ]
    } ]
  }
} ]
carousallie
  • 776
  • 1
  • 7
  • 25

1 Answers1

1

The desired output is more of a join, rather than a merge. A merge is just putting records, or content, one after another, which in this case creates a json array, where each entry in the array is the json of each flow file that is part of the merge.

Processors don't really do true streaming joins, but an approach that can be taken for joining two JSON flow files is described in this answer:

https://stackoverflow.com/a/55124212/5650316

Bryan Bende
  • 18,320
  • 1
  • 28
  • 39
  • It worked perfectly, however, it's made FatherID into an array ("FatherID" : ["1234", null])--any idea what may have caused it? – carousallie Apr 22 '19 at 19:12
  • I don't really know too much about JOLT, but I think it is not really expecting you to have fields with the same name but different values. I think if FatherId was 1234 in both then it might not create an array, or if only one document had FatherId as 1234 and the null one didn't exist, then it might not produce an array. – Bryan Bende Apr 22 '19 at 19:35
  • Sorry, I didn't write a clear comment. It made every field (except for nested values under residentialinformation) into an array. i.e. "FatherID" : ["1234", null], "FatherName" : ["George Washington", null] and so on and so forth – carousallie Apr 22 '19 at 19:40
  • Ok it may be that it is not meant to work when there are overlapping field names.. JOLT is trying to combine the field names from each record into one, and when it sees a field twice then it doesn't know what to do so it represents the values in array. – Bryan Bende Apr 22 '19 at 19:51
  • Someone with more JOLT experience may know if there is a better way. You can always fall back to a custom processor or scripted processor that can be more specific to your data and do whatever you need to do. – Bryan Bende Apr 22 '19 at 19:52