Suppose I have following xml data:
<students>
<studentId>110</studentId>
<info>
<rollNo>2</rollNo>
<address>
<permanent>abc</permanent>
<temporary>def</temporary>
</address>
</info>
<subjects>
<subject>
<name>maths</name>
<credit>3</credit>
</subject>
<subject>
<name>science</name>
<credit>2</credit>
</subject>
</subjects>
</students>
And its schema is:
root
|-- info: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- permanent: string (nullable = true)
| | |-- temporary: string (nullable = true)
| |-- rollNo: long (nullable = true)
|-- studentId: long (nullable = true)
|-- subjects: struct (nullable = true)
| |-- subject: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- credit: long (nullable = true)
| | | |-- name: string (nullable = true)
as root tag being "students"
.
Here, I wanted to update the value of certain columns.
I wanted to update the value of "studentId"
column using UDF
. And I found a way:
df = df.withColumn("studentId", updateValue(col("studentId")))
Then, I wanted to update a nested column i.e. "info.rollNo"
.
Applying above process gave me another new column as "<info.rollNo>updated_value</info.rollNo>
". After searching for a while, I found a way:
val colStruct = df.select(col("info" + ".*")).columns
.filter(_ != "rollNo")
.map(f => col("info" + "." + f))
df = df.withColumn("info",
struct(
(colStruct :+ updateValue(col("info.rollNo")
).as("rollNo")): _*)
)
For the third nested columns, I tried the above mentioned way. But I could not figure out the process.
Here, the question is, could someone explain me the algorithm of updating the nested columns value whose nest level could be 3,4,5 and so on.
For Instance: I want to update following fields.
"info.address.permanent"
which is struct
and
"subjects.subject.credit"
which is the element of array "subject"
PS: If you know any other ways to update certain columns, then please mention it.