2

I'm noticing my code repo is warning me that using withColumn in a for/while loop is an antipattern. Why is this not recommended? Isn't this a normal use of the PySpark API?

vanhooser
  • 1,497
  • 3
  • 19

1 Answers1

5

We've noticed in practice that using withColumn inside a for/while loop leads to poor query planning performance as discussed over here. This is not obvious when writing code for the first time in Foundry, so we've built a feature to warn you about this behavior.

We'd recommend you follow the Scala docs recommendation:

withColumn(colName: String, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column that has the same name.

Since
2.0.0

Note
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.

i.e.

my_other_columns = [...]

df = df.select(
  *[col_name for col_name in df.columns if col_name not in my_other_columns],
  *[F.col(col_name).alias(col_name + "_suffix") for col_name in my_other_columns]
)

is vastly preferred over

my_other_columns = [...]

for col_name in my_other_columns:
  df = df.withColumn(
    col_name + "_suffix",
    F.col(col_name)
  )

While this may technically be a normal use of the PySpark API, it will result in poor query planning performance if withColumn is called too many times in your job, so we'd prefer you avoid this problem entirely.

vanhooser
  • 1,497
  • 3
  • 19
  • 1
    i have been following all your posts and really appreciate the effort, If you can make the time, please post more of these foundry tricks and tips. – Asher Jan 15 '22 at 07:36
  • 1
    Happy to post them! I've posted / answered a bunch more today, feel free to check them out, they are all under the [palantir-foundry] tag – vanhooser Jan 20 '22 at 21:06
  • what if we had a complex when condition as a value, where would that go? since that cannot be put inside col( ), and when I tried with *[when....otherwise(....)] , it said invalid syntax *. Any idea? – user2441441 Aug 31 '23 at 22:04