I have two XML files (dummy versions below, actual contains PII) that I am attempting to merge together into one flow file using MergeRecord. My flow is that I get the files, I use ExtractText to extract the FatherID/DOCID and add it as an attribute, then use that in the MergeRecord config. My MergeRecord config includes an XML reader, a JSON writer, and an Avro schema which I’ve pasted below.
However, my output is one flow file that contains an array of JSONs rather just one JSON. This is a problem, as I will be putting this into Elasticsearch and if I put the one flow file into Elasticsearch, it treats it as two separate entries. I’ve heard some recommendations about using JoltTransform but I wouldn’t even know where to begin.
Input XML 1
<?xml version="1.0" encoding="UTF-8"?>
<FoundingFathers>
<FatherID>1234</FatherID>
<FatherName>George Washington</FatherName>
<ResidentialInformation>
<Name>Mount Vernon</Name>
<StreetAddress>3200 Mount Vernon Hwy</StreetAddress>
<City>Mt Vernon</City>
<State>VA</State>
<ZipCode>22121</ZipCode>
</ResidentialInformation>
<BirthDate>1732-02-22</BirthDate>
</FoundingFathers>
Input XML 2
<?xml version="1.0" encoding="UTF-8"?>
<DOC>
<DOCID>1234</DOCID>
<FATHERNAME>George Washington</FATHERNAME>
<RAW_TXT>George Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.</RAW_TXT>
<TXT>
<S>
<FATHERNAME>George Washington</FATHERNAME>
<ESTATENAME>Mount Vernon</>
<ESTATEPLACE>VA</ESTATEPLACE>
</S>
<S>
<OWNER>Washington family</OWNER>
<YEAROWNED>1674</YEAROWNED>
</S>
<S>
<BUILTIN>1734</BUILTIN>
<BUILTBY>Washington's father</BUILTBY>
</S>
</TXT>
</DOC>
MergeRecord Configs
Record Reader: XMLReader
Record Writer: JsonRecordSetWriter
Merge Strategy: Bin-Packing Algorithm
Correlation Attribute Name: FatherID
Attribute Strategy: Keep All Unique Attributes
Minimum Number of Records: 2
Maximum Number of Records: 2
Minimum Bin Size: 0 B
Maximum Bin Size: No value set
Max Bin Age: No value set
Maximum Number of Bins: 10
Schema
{
"namespace": "ff",
"name": "founders",
"type": "record",
"fields": [
{"name":"FatherID", "type": ["string", "null"], "default": null},
{"name":"FatherName", "type": ["string", "null"], "default": null},
{"name":"ResidentialInformation", "type": ["null", {
"name": "ResidentialInformation", "type": "array", "items": {
"name": "ResidentialInformation", "type": "record", "fields": [
{"name": "Name", "type": ["string","null"], "default":null},
{"name": "StreetAddress", "type": ["string","null"], "default":null},
{"name": "City", "type": ["string","null"], "default":null},
{"name": "State", "type": ["string","null"], "default":null},
{"name": "ZipCode", "type": ["string","null"], "default":null}
]
}
}]},
{"name":"BirthDate", "type": ["string", "null"], "default": null},
{"name": "DOCID", "type": ["string", "null"], "default": null},
{"name": "FINAME", "type": ["string", "null"], "default": null},
{"name": "CUSTNAME", "type": {"type": "array", "items": "string"}},
{"name": "RAW_TXT", "type": {"type": "array", "items": "string"}},
{"name": "TXT", "type": {
"name": "TXT", "type": "record", "namespace": "txt.sar", "fields": [
{"name": "S", "type": {
"type": "array", "items": {
"name": "RecordInArray", "type": "record", "fields": [
{"name": "FATHERNAME", "type": {"type": "array", "items": ["string","null"]}},
{"name": "ESTATENAME", "type": {"type": "array", "items": ["string","null"]}},
{"name": "ESTATEPLACE", "type": {"type": "array", "items": ["string","null"]}},
{"name": "OWNER", "type": {"type": "array", "items": ["string","null"]}},
{"name": "YEAROWNED", "type": {"type": "array", "items": ["string","null"]}},
{"name": "BUILTIN", "type": {"type": "array", "items": ["string","null"]}},
{"name": "BUILTBY", "type": {"type": "array", "items": ["string","null"]}}
]
}
}}
]
}}
]}
Actual Output
[ {
"FatherID" : "1234",
"FatherName" : "George Washington",
"ResidentialInformation" : [ {
"Name" : "Mount Vernon",
"StreetAddress" : "3200 Mount Vernon Hwy",
"City" : "Mt Vernon",
"State" : "VA",
"ZipCode" : "22121"
} ],
"BirthDate" : "1732-02-22",
"DOCID" : null,
"FATHERNAME" : null,
"RAW_TXT" : [ ],
"TXT" : null
}, {
"FatherID" : null,
"FatherName" : null,
"BirthDate" : null,
"DOCID" : "1234",
"FATHERNAME" : "George Washington",
"RAW_TXT" : [ "\nGeorge Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.\n" ],
"TXT" : {
"S" : [ {
"FATHERNAME" : [ "George Washington" ],
"ESTATENAME" : [ "Mount Vernon" ],
"ESTATEPLACE" : [ "VA" ]
}, {
"OWNER" : [ "Washington family" ],
"YEAROWNED" : [ "1674" ]
}, {
"BUILTIN" : [ "1734" ],
"BUILTBY" : [ "Washington's father" ]
} ]
}
} ]
Expected Output
[ {
"FatherID" : "1234",
"FatherName" : "George Washington",
"ResidentialInformation" : [ {
"Name" : "Mount Vernon",
"StreetAddress" : "3200 Mount Vernon Hwy",
"City" : "Mt Vernon",
"State" : "VA",
"ZipCode" : "22121"
} ],
"BirthDate" : "1732-02-22",
"DOCID" : "1234",
"FATHERNAME" : "George Washington",
"RAW_TXT" : [ "\nGeorge Washington lived in Mount Vernon in Mt Vernon, VA. The Washington family had owned land in the area since 1674. The original house was built in 1734 by Washington's father.\n" ],
"TXT" : {
"S" : [ {
"FATHERNAME" : [ "George Washington" ],
"ESTATENAME" : [ "Mount Vernon" ],
"ESTATEPLACE" : [ "VA" ]
}, {
"OWNER" : [ "Washington family" ],
"YEAROWNED" : [ "1674" ]
}, {
"BUILTIN" : [ "1734" ],
"BUILTBY" : [ "Washington's father" ]
} ]
}
} ]