Rename the column every time in PySpark if it is coming with different name in some files?

Question

I have to rename the column name every time if column name contains address in it.

For example, for the first file I am receiving columns as ADDRESS1, ADDRESS2, ADDRESS3:

For the next file I am receiving column names as T_ADDRESS1, T_ADDRESS2, TADDRESS3:

As I will write this code only once and pass it in my pipeline, how can I change the column name every time if it contains ADDRESS in it using PySpark? As in screenshots we can see, every time if name contains address in it we have to rename to S_ADDRESS1.

score 1 · Accepted Answer · answered Oct 04 '22 at 08:43

A Python function like this could rename dataframe columns:

import re
def df_rename(df):
    return df.toDF(*[re.sub(r"^(.*?)(?=ADDRESS)", "S_", c) for c in df.columns])

Test input dataframes:

df1 = spark.createDataFrame([], "c1 int, ADDRESS1 int, ADDRESS2 int, ADDRESS3 int")
df1.show()
# +---+--------+--------+--------+
# | c1|ADDRESS1|ADDRESS2|ADDRESS3|
# +---+--------+--------+--------+
# +---+--------+--------+--------+

df2 = spark.createDataFrame([], "c1 int, T_ADDRESS1 int, T_ADDRESS2 int, T_ADDRESS3 int")
df2.show()
# +---+----------+----------+----------+
# | c1|T_ADDRESS1|T_ADDRESS2|T_ADDRESS3|
# +---+----------+----------+----------+
# +---+----------+----------+----------+

Applying the function:

df1 = df_rename(df1)
df1.show()
# +---+----------+----------+----------+
# | c1|S_ADDRESS1|S_ADDRESS2|S_ADDRESS3|
# +---+----------+----------+----------+
# +---+----------+----------+----------+

df2 = df_rename(df2)
df2.show()
# +---+----------+----------+----------+
# | c1|S_ADDRESS1|S_ADDRESS2|S_ADDRESS3|
# +---+----------+----------+----------+
# +---+----------+----------+----------+

Thanks a lot for your response. Your code is working exactly the same i wanted. — DataWorld, Oct 04 '22 at 10:49
suppose column name is coming as S_ADDRESS only in that case what can i use to avoid this in code. by your code it will add S_ again. can you help me with this? — DataWorld, Oct 04 '22 at 13:28
I'm not sure I understand. If you don't want, just don't use this function. Either way the name of the column would stay S_ADDRESS — ZygD, Oct 04 '22 at 13:31

Rename the column every time in PySpark if it is coming with different name in some files?

1 Answers1