PySpark - Explode columns into rows based on the type of the column

Question

Given a Dataframe:

+---+-----------+---------+-------+------------+
| id|      score|tx_amount|isValid|    greeting|
+---+-----------+---------+-------+------------+
|  1|        0.2|    23.78|   true| hello_world|
|  2|        0.6|    12.41|  false|byebye_world|
+---+-----------+---------+-------+------------+

I want to explode these columns into a Row named "col_value" using the types of the input Dataframe.

df.dtypes

[('id', 'int'), ('model_score', 'double'), ('tx_amount', 'double'), ('isValid', 'boolean'), ('greeting', 'string')]

Expected output:

+---+------------+--------+---------+----------+-------+---------+ 
| id|   col_value|is_score|is_amount|is_boolean|is_text|col_name | 
+---+------------+--------+---------+----------+-------+---------+ 
|  1|         0.2|       Y|        N|         N|      N|score    | 
|  1|       23.78|       N|        Y|         N|      N|tx_amount| 
|  1|        true|       N|        N|         Y|      N|isValid  | 
|  1| hello_world|       N|        N|         N|      Y|greeting | 
|  2|         0.6|       Y|        N|         N|      N|score    | 
|  2|       12.41|       N|        Y|         N|      N|tx_amount| 
|  2|       false|       N|        N|         Y|      N|isValid  | 
|  2|byebye_world|       N|        N|         N|      Y|greeting | 
+---+------------+--------+---------+----------+-------+---------+

What I have so far:

df.withColumn("cols", F.explode(F.arrays_zip(F.array("score", "tx_amount", "isValid", "greeting")))) \
  .select("id", F.col("cols.*")) \
  ...

But it gives an error about types when I try to zip the cols to use in the explode:

pyspark.sql.utils.AnalysisException: "cannot resolve 'array(`id`, `model_score`, `tx_amount`, `isValid`, `greeting`)' due to data type mismatch: input to function array should all be the same type, but it's [int, double, double, boolean, string]

How can I do this when the types of the input columns can be all different?

the problem is that "col_value" should only be of one type. which one ? string ? — Steven, Apr 07 '20 at 12:32
it can also be in multiple columns, i just need to be able to explode it — bp2010, Apr 07 '20 at 13:54
why do you want absolutely to use `explode` ? that's a [xy problem](http://xyproblem.info/) — Steven, Apr 08 '20 at 06:40
Its not necessary to use it. Can be any other way too. Must be generic and using types — bp2010, Apr 08 '20 at 09:12

murtihash · Accepted Answer · 2020-04-09T04:45:25.743

Sample DataFrame:

df.show()
df.printSchema()

+---+-----------+---------+-------+------------+
| id|model_score|tx_amount|isValid|    greeting|
+---+-----------+---------+-------+------------+
|  1|        0.2|    23.78|   true| hello_world|
|  2|        0.6|    12.41|  false|byebye_world|
+---+-----------+---------+-------+------------+

root
 |-- id: integer (nullable = true)
 |-- model_score: double (nullable = true)
 |-- tx_amount: double (nullable = true)
 |-- isValid: boolean (nullable = true)
 |-- greeting: string (nullable = true)

I tried to keep it dynamic for any input of columns. It will take type from df.dtypes[1:] because id is not included in col_value that is why skipping it(1:). Array only accepts same type in it, thats why we will convert all cols to string before applying the logic. I think it should work for your use case. You can build your Y/N cols from here.

df.select([F.col(c).cast("string") for c in df.columns])\
        .withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(x[0],F.lit(x[1]),F.lit(x[0]))\
                                                    for x in df.dtypes[1:]]))))\
        .select("id", F.col("cols.*")).withColumn("col_value", F.element_at("0",1))\
                                      .withColumn("col_type", F.element_at("0",2))\
                                      .withColumn("col_name", F.element_at("0",3)).drop("0").show()

+---+------------+--------+-----------+
| id|   col_value|col_type|   col_name|
+---+------------+--------+-----------+
|  1|         0.2|  double|model_score|
|  1|       23.78|  double|  tx_amount|
|  1|        true| boolean|    isValid|
|  1| hello_world|  string|   greeting|
|  2|         0.6|  double|model_score|
|  2|       12.41|  double|  tx_amount|
|  2|       false| boolean|    isValid|
|  2|byebye_world|  string|   greeting|
+---+------------+--------+-----------+

This only works when the df is of type string. In this case the df is coming from reading a csv, and the output to df.dtypes is: `[('id', 'int'), ('model_score', 'double'), ('tx_amount', 'double'), ('isValid', 'boolean'), ('greeting', 'string')]` so the above will not work when the df already has its types — bp2010, Apr 08 '20 at 10:51
thanks for the answer. Input cols could be different at each run, which is why I am trying to use the input types to dynamically give it the value — bp2010, Apr 08 '20 at 11:02
I have **`updated`** answer, and used **`df.dtypes`** to keep it dynamic for any input cols. try it — murtihash, Apr 09 '20 at 04:40
For this case, we could use create_map but it can’t be dynamic, you would have to hard code the columns in it. Will it update if I figure out — murtihash, Apr 29 '20 at 14:56
I'll look into create_map.. I've added a question here: https://stackoverflow.com/questions/61503929/pyspark-arrays-zip-equivalent-in-spark-2-3 — bp2010, Apr 30 '20 at 08:03

score 1 · Answer 2 · answered Apr 07 '20 at 12:37

you can try several unions :


df = df.select(
    "id",
    F.col("score").cast("string").alias("col_value"),
    F.lit("Y").alias("is_score"),
    F.lit("N").alias("is_amount"),
    F.lit("N").alias("is_boolean"),
    F.lit("N").alias("is_text"),
).union(df.select(
    "id",
    F.col("tx_amount").cast("string").alias("col_value"),
    F.lit("N").alias("is_score"),
    F.lit("Y").alias("is_amount"),
    F.lit("N").alias("is_boolean"),
    F.lit("N").alias("is_text"),
)).union(...) # etc

PySpark - Explode columns into rows based on the type of the column

2 Answers2

Linked