I have two data frames in pyspark
. Their schema's are below
df1
DataFrame[customer_id: int, email: string, city: string, state: string, postal_code: string, serial_number: string]
df2
DataFrame[serial_number: string, model_name: string, mac_address: string]
Now I want to do a full outer join
on these two data frames by using coalesce
on the column common in both the data frames
.
I have done like below. I got the expected result.
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select(df1.customer_id, df1.email, df1.city, df1.state, df1.postal_code, f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number'), df2.model_name, df2.mac_address)
Now I want to do the above little differently. Instead of writing all the column names near select in the join statement i want to do something like using *
on the data frame
. Basically I want something like below.
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select('df1.*', f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number1'), df2.model_name, df2.mac_address).drop('serial_number')
I am getting what I want. Is there a better way to this kind of operation in pyspark
edit
This is not a duplicate of https://stackoverflow.com/questions/36132322/join-two-data-frames-select-all-columns-from-one-and-some-columns-from-the-othe?rq=1
I am using a coalesce
in the join statement. I want to know if there is a way where we can exclude the column on which I am using the coalesce
function