0

I went through the link here Why is my Code Repo warning me about using withColumn in a for/while loop?

which says that we need to avoid withColumn, and use this instead:

    my_other_columns = [...]

df = df.select(
  *[col_name for col_name in df.columns if col_name not in my_other_columns],
  *[F.col(col_name).alias(col_name + "_suffix") for col_name in my_other_columns]
)

This avoids any SO issues. My question is more of a followup. I have a complex nested when.otherwise logic that I'd like to use above. So I have tried:

 df = df.select(
    *[col_name for col_name in df.columns if col_name not in column_list],
    *[when(...).otherwise(
        when(...).otherwise(when(...).otherwise(...)))
          .alias(col_name + "_new_name") for col_name in column_list],
          
    *[df["result"] + abs(df[col_name])
          .alias("result") for col_name in column_list]
  )

I also tried from this helpful link:

cols_to_keep = [c for c in df.columns if c not in column_list]
cols_transformed1 = [when(...).otherwise(
                                         when(....).otherwise(when(...).otherwise(...))).alias(c + "_new_name") for c in cols_to_compare]
cols_transformed2 = [df["result"] + abs(df[col_name]).alias("result") for col_name in cols_to_compare]      
df.select(*cols_to_keep, *cols_transformed1, *cols_transformed2)
#throws invalid syntax at * at *cols_transformed1

The when part works when put inside a for loop, like:

for col_name in column_list:
     df = df.withColumn(col_name + "_new_name", 
        when(...).otherwise(
        when(...).otherwise(when(...).otherwise(...)))
          )

But does not work when plugging in the top example though. I get syntax error at '*' in *[when(...)....] line. I have tried various combinations including removing *, etc. but none have worked.

Is it possible to have this kind of complex when logic using the example at the top? I'm not an expert in Python and struggling to get this working.

Update: Look slike since I'm using Pythoin 2.7 the unpacking operator * doesn't work? What would be a workaround if that is the case?

user2441441
  • 1,237
  • 4
  • 24
  • 45

1 Answers1

0

Instead of using the unpacking operator *, pass a list to the select method. Check the API. Accepted types are str, Column, or list. The workaround it to create a list with all your expressions:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

expr_list = []
for c in df.columns:
    expr_list.append(F.col(c))

when_list = []
for c in df.columns:
    when_list.append(F.when(F.col(c) < 2, F.lit(0)).otherwise(F.lit(None)))

df.select(expr_list + when_list).show()
+---+----+---------------------------------------+--------------------------------------+
| id|   v|CASE WHEN (id < 2) THEN 0 ELSE NULL END|CASE WHEN (v < 2) THEN 0 ELSE NULL END|
+---+----+---------------------------------------+--------------------------------------+
|  1| 1.0|                                      0|                                     0|
|  1| 2.0|                                      0|                                  null|
|  2| 3.0|                                   null|                                  null|
|  2| 5.0|                                   null|                                  null|
|  2|10.0|                                   null|                                  null|
+---+----+---------------------------------------+--------------------------------------+
cruzlorite
  • 359
  • 1
  • 12