I have been trying to join two dataframes using the following list of join key passed as a list and I want to add the functionality to join on a subset of the keys if one of the key value is null
I have been trying to join two dataframes df_1 and df_2.
data1 = [[1,'2018-07-31',215,'a'],
[2,'2018-07-30',None,'b'],
[3,'2017-10-28',201,'c']
]
df_1 = sqlCtx.createDataFrame(data1,
['application_number','application_dt','account_id','var1'])
and
data2 = [[1,'2018-07-31',215,'aaa'],
[2,'2018-07-30',None,'bbb'],
[3,'2017-10-28',201,'ccc']
]
df_2 = sqlCtx.createDataFrame(data2,
['application_number','application_dt','account_id','var2'])
The code I use to join is this:
key_a = ['application_number','application_dt','account_id']
new = df_1.join(df_2,key_a,'left')
The output for the same is:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b|null|
+------------------+--------------+----------+----+----+
My concern here is, in the case where account_id is null, the join should still work by comparing other 2 keys.
The required output should be like this:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b| bbb|
+------------------+--------------+----------+----+----+
I have found a similar approach to do so by using the statement:
join_elem = "df_1.application_number ==
df_2.application_number|df_1.application_dt ==
df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==
F.coalesce(df_2.account_id,F.lit(0))".split("|")
join_elem_column = [eval(x) for x in join_elem]
But the design consideration do not allow me to use a full join expression and i am stuck with using the list of column names as join-key.
I have been trying to find a way to accommodate this coalesce thing into this list itself but have not found any success so far.