I'm attempting to process bz2 compressed XML files with a nested XML schema into normalized tables where each level of the schema is stored as a row, and any child elements are stored as rows in a separate table with a foreign key relating back to the keyed row of which it is a child.
These files can be rather large, 181MB compressed exploding into large numbers of rows from within a single file. If we don't supply the schema to the DataFrameReader, it will crash with out of mem errors just trying to parse the file and infer the schema. I provided the schema to get past this but...
I currently have a DataFrame at the parent level of the XML selecting a nested schema out of the entire file, but when I try to write the parent DataFrame node into a parquet, it is causing the single node doing the file load to throw out of mem errors.
How can I traverse the XML and write the data into normalized parquets for each level of the schema?
The schema is:
val schema = StructType(Seq(
StructField("_xmlns", StringType, true),
StructField("includedCalendarYearCategory", StructType(Seq(
StructField("calendarYear", LongType, true),
StructField("includedPlanCategory", StructType(Seq(
StructField("includedBilltypeCategory", ArrayType(StructType(Seq(
StructField("billTypeCode", LongType, true),
StructField("claimsExcluded", LongType, true),
StructField("claimsIncluded", LongType, true)
)), true), true),
StructField("includedMedicalClaimCategory", ArrayType(StructType(Seq(
StructField("enrolleeIdentifier", StringType, true),
StructField("includedSupplementalRecordCategory", StructType(Seq(
StructField("addDeleteVoidCode", StringType, true),
StructField("enrolleeIdentifier", StringType, true),
StructField("originalMedicalClaimId", StringType, true),
StructField("reasonCode", StringType, true),
StructField("supplementalDetailRecordId", StringType, true)
)), true),
StructField("medicalClaimIdentifier", StringType, true),
StructField("raEligibleIndicator", LongType, true),
StructField("rxcEligibleIndicator", LongType, true),
StructField("serviceCode", LongType, true)
)), true), true),
StructField("includedPharmacyClaimCategory", ArrayType(StructType(Seq(
StructField("enrolleeIdentifier", StringType, true),
StructField("nationalDrugCode", LongType, true),
StructField("pharmacyClaimIdentifier", StringType, true),
StructField("policyPaidAmount", DoubleType, true),
StructField("raEligibleIndicator", LongType, true),
StructField("reasonCode", StringType, true)
)), true), true),
StructField("includedReasonCodeCategory", ArrayType(StructType(Seq(
StructField("medicalClaimsExcluded", LongType, true),
StructField("medicalReasonCode", StringType, true),
StructField("pharmacyClaimsExcluded", LongType, true),
StructField("pharmacyReasonCode", StringType, true)
)), true), true),
StructField("includedServiceCodeCategory", ArrayType(StructType(Seq(
StructField("claimsExcluded", LongType, true),
StructField("claimsIncluded", LongType, true),
StructField("serviceCode", StringType, true)
)), true), true),
StructField("includedUnlinkedSupplementalCategory", ArrayType(StructType(Seq(
StructField("addDeleteVoidCode", LongType, true),
StructField("enrolleeIdentifier", StringType, true),
StructField("originalMedicalClaimId", StringType, true),
StructField("supplementalDiagnosisDetailRecordId", LongType, true)
)), true), true),
StructField("medicalClaimsExcluded", LongType, true),
StructField("medicalClaimsIncluded", LongType, true),
StructField("pharmacyClaimsExcluded", LongType, true),
StructField("pharmacyClaimsIncluded", LongType, true),
StructField("planIdentifier", StringType, true),
StructField("supplementalRecordsExcluded", LongType, true),
StructField("supplementalRecordsIncluded", LongType, true),
StructField("totalEnrollees", LongType, true),
StructField("totalEnrolleesWRaEligibleclaims", LongType, true),
StructField("totalUniqueNDC", LongType, true)
)), true),
StructField("medicalClaimsExcluded", LongType, true),
StructField("medicalClaimsIncluded", LongType, true),
StructField("pharmacyClaimsExcluded", LongType, true),
StructField("pharmacyClaimsIncluded", LongType, true),
StructField("supplementalRecordsExcluded", LongType, true),
StructField("supplementalRecordsIncluded", LongType, true),
StructField("totalEnrollees", LongType, true),
StructField("totalEnrolleesWRaEligibleclaims", LongType, true),
StructField("totalUniqueNDC", LongType, true)
)), true),
StructField("includedFileHeader", StructType(Seq(
StructField("cmsBatchIdentifier", StringType, true),
StructField("cmsJobIdentifier", LongType, true),
StructField("edgeServerIdentifier", LongType, true),
StructField("edgeServerProcessIdentifier", LongType, true),
StructField("edgeServerVersion", StringType, true),
StructField("globalReferenceDataVersion", StringType, true),
StructField("interfaceControlReleaseNumber", StringType, true),
StructField("issuerIdentifier", LongType, true),
StructField("outboundFileGenerationDateTime", TimestampType, true),
StructField("outboundFileIdentifier", StringType, true),
StructField("outboundFileTypeCode", StringType, true),
StructField("snapShotFileHash", StringType, true),
StructField("snapShotFileName", StringType, true)
)), true)
))
We are reading the dataframe like:
sparkSession.read
.schema(schema) // schema from above code block
.format("xml")
.option("rootTag", "riskAdjustmentClaimSelectionDetailReport")
.option("rowTag", "riskAdjustmentClaimSelectionDetailReport")
.xml(path)
.repartition(200)
Repartitioning I thought would help spread the single file across multiple nodes but it makes sense that the XML has to be fully parsed before it can figure out how to chunk it. Is there something I can configure in spark to load a massive XML file across a cluster instead of all in a single node using the embedded spark-xml library?