PySpark: Insert or update dataframe with another dataframe

Question

I have two dataframes, DF1 and DF2. DF1 is the master and DF2 is the delta. The data from DF2 should be inserted into DF1 or used to update the DF1 data.

Lets say the DF1 is of the following format:

id_no	start_date	amount	days
1	2016-01-01	4650	22
2	2016-01-02	3130	45
1	2016-01-03	4456	22
2	2016-01-15	1234	45

DF2 contains the following:

id_no	start_date	amount	days
1	2016-01-01	8650	52
2	2016-01-02	7130	65
1	2016-01-06	3456	20
2	2016-01-20	2345	19
3	2016-02-02	1345	19

I need to combine the two dataframes such that if the "id_no" and "start date" of DF2 matches DF1, it should be replaced in DF1 and if does not match, it should be inserted into DF1. The "id_no" is not unique.

The expected result:

id_no	start_date	amount	days
1	2016-01-01	8650	52
2	2016-01-02	7130	65
1	2016-01-03	4456	22
2	2016-01-15	1234	45
1	2016-01-06	3456	20
2	2016-01-20	2345	19
3	2016-02-02	1345	19

Psidom · Answer 1 · 2018-08-24T16:20:13.220

You can join the two data frames on id_no and start_date, and then coalesce the amount and days column with columns from df2 coming first:

import pyspark.sql.functions as f

df1.alias('a').join(
    df2.alias('b'), ['id_no', 'start_date'], how='outer'
).select('id_no', 'start_date', 
    f.coalesce('b.amount', 'a.amount').alias('amount'), 
    f.coalesce('b.days', 'a.days').alias('days')
).show()

+-----+----------+------+----+
|id_no|start_date|amount|days|
+-----+----------+------+----+
|    1|2016-01-06|  3456|  20|
|    2|2016-01-20|  2345|  19|
|    1|2016-01-03|  4456|  22|
|    3|2016-02-02|  1345|  19|
|    2|2016-01-15|  1234|  45|
|    1|2016-01-01|  8650|  52|
|    2|2016-01-02|  7130|  65|
+-----+----------+------+----+

If you have many more columns:

cols = ['amount', 'days']

df1.alias('a').join(
    df2.alias('b'), ['id_no', 'start_date'], how='outer'
).select('id_no', 'start_date', 
    *(f.coalesce('b.' + col, 'a.' + col).alias(col) for col in cols)
).show()
+-----+----------+------+----+
|id_no|start_date|amount|days|
+-----+----------+------+----+
|    1|2016-01-06|  3456|  20|
|    2|2016-01-20|  2345|  19|
|    1|2016-01-03|  4456|  22|
|    3|2016-02-02|  1345|  19|
|    2|2016-01-15|  1234|  45|
|    1|2016-01-01|  8650|  52|
|    2|2016-01-02|  7130|  65|
+-----+----------+------+----+

thanks for your answer. What if the dataframes contain 25 more columns? Should I use coalesce for all the 25 columns? — navin, Aug 24 '18 at 15:59
You can do it programmatically by looping through the list of columns, `coalesce` `df2` and `df1`, and use the `*` syntax in `select`. — Psidom, Aug 24 '18 at 16:22

score 1 · Answer 2 · answered Nov 02 '21 at 16:43

1

I'm looking into this myself at the moment. It looks like spark supports SQL's MERGE INTO that should be good for this task. You would just need to create a new_id that is a join of id_no and start_date

MERGE INTO df1
USING df2
ON df1.new_id = df2.new_id
WHEN MATCHED THEN
  UPDATE SET df1.amount = df2.amount, df1.days = df2.days
WHEN NOT MATCHED
  THEN INSERT *

answered Nov 02 '21 at 16:43

Fer Muñoz Méndez

13
3

1

wait, this is only for delta table , right ? not for dataframe – OrganicMustard Jan 24 '22 at 23:37

score 0 · Answer 3 · answered Mar 15 '21 at 11:01

union should do it if both dfs are of the same structure.

from pyspark.sql import functions as F
grp_by = {'id_no', 'start_date'}
df = df2.union(df1)
df = df.groupby(*grp_by).agg(*[F.first(c).alias(c) for c in set(df.columns)-grp_by])
df.show()
#     +-----+----------+----+------+
#     |id_no|start_date|days|amount|
#     +-----+----------+----+------+
#     |    1|2016-01-06|  20|  3456|
#     |    2|2016-01-20|  19|  2345|
#     |    1|2016-01-03|  22|  4456|
#     |    3|2016-02-02|  19|  1345|
#     |    2|2016-01-15|  45|  1234|
#     |    1|2016-01-01|  52|  8650|
#     |    2|2016-01-02|  65|  7130|
#     +-----+----------+----+------+

PySpark: Insert or update dataframe with another dataframe

3 Answers3

Linked