7

I have two dataframes, DF1 and DF2. DF1 is the master and DF2 is the delta. The data from DF2 should be inserted into DF1 or used to update the DF1 data.

Lets say the DF1 is of the following format:

id_no start_date amount days
1 2016-01-01 4650 22
2 2016-01-02 3130 45
1 2016-01-03 4456 22
2 2016-01-15 1234 45

DF2 contains the following:

id_no start_date amount days
1 2016-01-01 8650 52
2 2016-01-02 7130 65
1 2016-01-06 3456 20
2 2016-01-20 2345 19
3 2016-02-02 1345 19

I need to combine the two dataframes such that if the "id_no" and "start date" of DF2 matches DF1, it should be replaced in DF1 and if does not match, it should be inserted into DF1. The "id_no" is not unique.

The expected result:

id_no start_date amount days
1 2016-01-01 8650 52
2 2016-01-02 7130 65
1 2016-01-03 4456 22
2 2016-01-15 1234 45
1 2016-01-06 3456 20
2 2016-01-20 2345 19
3 2016-02-02 1345 19
ZygD
  • 22,092
  • 39
  • 79
  • 102
navin
  • 384
  • 2
  • 4
  • 15

3 Answers3

11

You can join the two data frames on id_no and start_date, and then coalesce the amount and days column with columns from df2 coming first:

import pyspark.sql.functions as f

df1.alias('a').join(
    df2.alias('b'), ['id_no', 'start_date'], how='outer'
).select('id_no', 'start_date', 
    f.coalesce('b.amount', 'a.amount').alias('amount'), 
    f.coalesce('b.days', 'a.days').alias('days')
).show()

+-----+----------+------+----+
|id_no|start_date|amount|days|
+-----+----------+------+----+
|    1|2016-01-06|  3456|  20|
|    2|2016-01-20|  2345|  19|
|    1|2016-01-03|  4456|  22|
|    3|2016-02-02|  1345|  19|
|    2|2016-01-15|  1234|  45|
|    1|2016-01-01|  8650|  52|
|    2|2016-01-02|  7130|  65|
+-----+----------+------+----+

If you have many more columns:

cols = ['amount', 'days']

df1.alias('a').join(
    df2.alias('b'), ['id_no', 'start_date'], how='outer'
).select('id_no', 'start_date', 
    *(f.coalesce('b.' + col, 'a.' + col).alias(col) for col in cols)
).show()
+-----+----------+------+----+
|id_no|start_date|amount|days|
+-----+----------+------+----+
|    1|2016-01-06|  3456|  20|
|    2|2016-01-20|  2345|  19|
|    1|2016-01-03|  4456|  22|
|    3|2016-02-02|  1345|  19|
|    2|2016-01-15|  1234|  45|
|    1|2016-01-01|  8650|  52|
|    2|2016-01-02|  7130|  65|
+-----+----------+------+----+
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • thanks for your answer. What if the dataframes contain 25 more columns? Should I use coalesce for all the 25 columns? – navin Aug 24 '18 at 15:59
  • You can do it programmatically by looping through the list of columns, `coalesce` `df2` and `df1`, and use the `*` syntax in `select`. – Psidom Aug 24 '18 at 16:22
1

I'm looking into this myself at the moment. It looks like spark supports SQL's MERGE INTO that should be good for this task. You would just need to create a new_id that is a join of id_no and start_date

MERGE INTO df1
USING df2
ON df1.new_id = df2.new_id
WHEN MATCHED THEN
  UPDATE SET df1.amount = df2.amount, df1.days = df2.days
WHEN NOT MATCHED
  THEN INSERT *
0

union should do it if both dfs are of the same structure.

from pyspark.sql import functions as F
grp_by = {'id_no', 'start_date'}
df = df2.union(df1)
df = df.groupby(*grp_by).agg(*[F.first(c).alias(c) for c in set(df.columns)-grp_by])
df.show()
#     +-----+----------+----+------+
#     |id_no|start_date|days|amount|
#     +-----+----------+----+------+
#     |    1|2016-01-06|  20|  3456|
#     |    2|2016-01-20|  19|  2345|
#     |    1|2016-01-03|  22|  4456|
#     |    3|2016-02-02|  19|  1345|
#     |    2|2016-01-15|  45|  1234|
#     |    1|2016-01-01|  52|  8650|
#     |    2|2016-01-02|  65|  7130|
#     +-----+----------+----+------+
ZygD
  • 22,092
  • 39
  • 79
  • 102