0

I have two data frames in pyspark. Their schema's are below

df1 
DataFrame[customer_id: int, email: string, city: string, state: string, postal_code: string, serial_number: string]

df2 
DataFrame[serial_number: string, model_name: string, mac_address: string]

Now I want to do a full outer join on these two data frames by using coalesce on the column common in both the data frames.

I have done like below. I got the expected result.

full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select(df1.customer_id, df1.email, df1.city, df1.state, df1.postal_code,  f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number'), df2.model_name, df2.mac_address)

Now I want to do the above little differently. Instead of writing all the column names near select in the join statement i want to do something like using * on the data frame. Basically I want something like below.

full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select('df1.*', f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number1'), df2.model_name, df2.mac_address).drop('serial_number')

I am getting what I want. Is there a better way to this kind of operation in pyspark

edit

This is not a duplicate of https://stackoverflow.com/questions/36132322/join-two-data-frames-select-all-columns-from-one-and-some-columns-from-the-othe?rq=1 I am using a coalesce in the join statement. I want to know if there is a way where we can exclude the column on which I am using the coalesce function

User12345
  • 5,180
  • 14
  • 58
  • 105

1 Answers1

5

You can do something like this:

(df1
    .join(df2, df1.serial_number == df2.serial_number, 'full_outer')
    .select(
        [df1[c] for c in df1.columns if c != 'serial_number'] + 
        [f.coalesce(df1.serial_number, df2.serial_number)]
    ))