Given a Dataframe:
+---+-----------+---------+-------+------------+
| id| score|tx_amount|isValid| greeting|
+---+-----------+---------+-------+------------+
| 1| 0.2| 23.78| true| hello_world|
| 2| 0.6| 12.41| false|byebye_world|
+---+-----------+---------+-------+------------+
I want to explode these columns into a Row named "col_value" using the types of the input Dataframe.
df.dtypes
[('id', 'int'), ('model_score', 'double'), ('tx_amount', 'double'), ('isValid', 'boolean'), ('greeting', 'string')]
Expected output:
+---+------------+--------+---------+----------+-------+---------+
| id| col_value|is_score|is_amount|is_boolean|is_text|col_name |
+---+------------+--------+---------+----------+-------+---------+
| 1| 0.2| Y| N| N| N|score |
| 1| 23.78| N| Y| N| N|tx_amount|
| 1| true| N| N| Y| N|isValid |
| 1| hello_world| N| N| N| Y|greeting |
| 2| 0.6| Y| N| N| N|score |
| 2| 12.41| N| Y| N| N|tx_amount|
| 2| false| N| N| Y| N|isValid |
| 2|byebye_world| N| N| N| Y|greeting |
+---+------------+--------+---------+----------+-------+---------+
What I have so far:
df.withColumn("cols", F.explode(F.arrays_zip(F.array("score", "tx_amount", "isValid", "greeting")))) \
.select("id", F.col("cols.*")) \
...
But it gives an error about types when I try to zip the cols to use in the explode:
pyspark.sql.utils.AnalysisException: "cannot resolve 'array(`id`, `model_score`, `tx_amount`, `isValid`, `greeting`)' due to data type mismatch: input to function array should all be the same type, but it's [int, double, double, boolean, string]
How can I do this when the types of the input columns can be all different?