1

I am doing a basic transformation on my pyspark dataframe but here i am using multiple .withColumn statements.

    def trim_and_lower_col(col_name):
        return F.when(F.trim(col_name) == "", F.lit("unspecified")).otherwise(F.lower(F.trim(col_name)))

    df = (
        source_df.withColumn("browser", trim_and_lower_col("browser"))
        .withColumn("browser_type", trim_and_lower_col("browser_type"))
        .withColumn("domains", trim_and_lower_col("domains"))
    )

I read that creating multiple withColumn statements isn't very efficient and i should use df.select() instead. I tried this:

    cols_to_transform = [
    "browser",
    "browser_type",
    "domains"
    ]


    df = (
    source_df.select([trim_and_lower_col(col).alias(col) for col in cols_to_transform] + source_df.columns)
    )

but it gives me a duplicate column error

What else can I try?

x89
  • 2,798
  • 5
  • 46
  • 110

3 Answers3

3

The duplicate column comes because you pass each transformed column twice in that list, once as your newly transformed column (through .alias) as original column (by name in source_df.columns). This solution will allow you to use a single select statement, preserve the column order and not hit the duplication issue:

df = (
    source_df.select([trim_and_lower_col(col).alias(col) if col in cols_to_transform else col for col in source_df.columns])
)

Chaining many .withColumn does pose a problem as the unresolved query plan can get pretty large and cause StackOverflow error on Spark driver during query plan optimisation. One good explanation of this problem is shared here: https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015

proggeo
  • 609
  • 5
  • 11
  • this would give me an error that: A function object does not have an attribute alias. Please check the spelling and/or the datatype of the object. – x89 Feb 07 '23 at 14:30
  • that is weird, I have just double checked that this works in Code Authoring in Foundry. are you sure there's no typo? – proggeo Feb 07 '23 at 14:32
0

You are naming your new columns the following: .alias(col). That means that they have the same name as the column you use to create the new one.

During the creation (using .withColumn) this does not pose a problem. As soon as you are trying to select, Spark does not know which column to pick.

You could fix it for example by giving the new columns a suffix:

cols_to_transform = [
"browser",
"browser_type",
"domains"
]


df = (
source_df.select([trim_and_lower_col(col).alias(f"{col}_new") for col in cols_to_transform] + source_df.columns)
)

Another solution, which does pollute the DAG though, would be:

cols_to_transform = [
"browser",
"browser_type",
"domains"
]

for col in cols_to_transform:
    source_df = source_df.withColumn(col, trim_and_lower_col(col))
Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
  • is there no way to simple "replace" the column? i don't want to change the name – x89 Feb 07 '23 at 13:00
  • something like a map? – x89 Feb 07 '23 at 13:01
  • I've added an iterative example that is a little bit more beautiful codewise. Performance-wise both executions are equally fast but the creation of the DAG is slower when you use the .withColumn approach. You could also just chain another select statement that renames the columns back, but this would also add a step to the DAG. – Robert Kossendey Feb 07 '23 at 13:40
0

If you only have these few withColumns, keep using them.. It's still way more readable thus way more maintainable and self explanatory..

If you look into it, you'll see that spark says to be careful with the withColumns when you have like 200 of them.

Using select makes your code more error prone too since it's more complex to read.

Now, if you have many columns, I would define

  • the list of the column to transform,
  • the list of the column to keep
  • then do the select
cols_to_transform = [
  "browser",
  "browser_type",
  "domains"
]
cols_to_keep = [c for c in df.columns if c not in cols_to_transform]
cols_transformed = [trim_and_lower_col(c).alias(c) for c in cols_to_transform]
source_df.select(*cols_to_keep, *cols_transformed)

This would give you the same column order as the withColumns.

G. C.
  • 26
  • 4