I have a schema like this :
root
|-- DataColumn1: struct (nullable = true)
| |-- colA: double (nullable = true)
| |-- colB: struct (nullable = true)
| | |-- fieldA: double (nullable = true)
| | |-- fieldB: double (nullable = true)
| | |-- fieldC: double (nullable = true)
| |-- colC: long (nullable = true)
| |-- colD: string (nullable = true)
|-- DataColumn2: string (nullable = true)
|-- DataColumn3: string (nullable = true)
My goal is to create a new column say 'DataColumn4' which is the sum of all the fields 'fieldA', 'fieldB' and 'fieldC' (fieldA + fieldB + fieldC) inside the struct 'colB' which is inside 'DataColumn1'.
There could be N number of fields inside 'colB' so how do I sum them all without accessing the fields one by one through DataColumn1.colB.fieldA, DataColumn1.colB.fieldB and so on?
Example data:
DataColumn1 DataColumn2 DataColumn3
(1, (1, 2, 3), 4, 5) XXX YYY
(1, (2, 3, 3), 8, 9) XYZ XYX
My expected result must have a new column that is a summation of the nested fields
DataColumn1 DataColumn2 DataColumn3. DataColumn4
(1, (1, 2, 3), 4, 5) XXX YYY. 6 (since 1+2+3 = 6)
(1, (2, 3, 3), 8, 9) XYZ XYX 8 (since 2+3+3 = 8)
How do I write a code for this in PySpark preferably without a PandasUDF?