0

I was trying to load XML files using DataBricks Spark XML. I am able to load the data properly, but I need to change the name of one of the column and put it as a separate tag inside the schema. Basically, there are few tags which need to be generated as null which is not coming in the data.( These fields are in an XSD).

Example:-

root
  First Tag
     Element Name
     Second Tag ( Tag To Change)
        Tag3
        Tag4

I need to change to

root
  First Tag
     Element Name
     Second Tag 
        Tag3
        Tag4
     Third Tag 
        Tag3
        Tag4

I have tried many ways:- ( I cannot add schema manually).

  1. withColumn.- > ( With this option I am able to add a new column but at the root level, I need to add it to a definite hierarchy.)
  2. withColumnRenamed -> ( This option does not change anything).

Any help is appreciated!

Deepan Ram
  • 842
  • 1
  • 10
  • 25

1 Answers1

1

Well there is no shortcut for doing it as it does not allows to change the schema which is more than 1 level down.

So you might think of breaking the complex tags into 1 level simple tags including a primary key to identify and join back the records.

Once you have the simple tags, then with columnRenamed or other option you can change the data type and join back using the primary key to create the original dataframe ( but with modified names or types ).

Puja Basu
  • 26
  • 4