Pyspark how can identify unmatched row value from two data frame

Question

I have below two data frame from which i am trying to identify the unmatched row value from data frame two. This is the part of migration where i want to see the difference after source data being migrated/moved to different destination.

source_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc|  1.1|  1.2|
|def|  3.0|  3.4|
+---+-----+-----+

dest_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc|  2.1|  2.2|
|def|  3.0|  3.4|
+---+-----+-----+

i want to see the output something like below

key: abc,

col:          val11                  val12

difference:  [src-1.1,dst:2.1]       [src:1.2,dst:2.2]

Any solution for this?

Luiz Viola · Accepted Answer · 2022-03-08T13:14:27.867

0

source_df  = spark.createDataFrame(
  [
('abc','1.1','1.2'),
('def','3.0','3.4'),
  ], ['key','val11','val12']
)

dest_df  = spark.createDataFrame(
  [
('abc','2.1','2.2'),
('def','3.0','3.4'),
  ], ['key','val11','val12']
)

report = source_df\
    .join(dest_df, 'key', 'full')\
    .filter((source_df.val11 != dest_df.val11) | (source_df.val12 != dest_df.val12))\
    .withColumn('difference_val11', F.concat(F.lit('[src:'), source_df.val11, F.lit(',dst:'),dest_df.val11,F.lit(']')))\
    .withColumn('difference_val12', F.concat(F.lit('[src:'), source_df.val12, F.lit(',dst:'),dest_df.val12,F.lit(']')))\
    .select('key', 'difference_val11', 'difference_val12')

report.show()

+---+-----------------+-----------------+
|key| difference_val11| difference_val12|
+---+-----------------+-----------------+
|abc|[src:1.1,dst:2.1]|[src:1.1,dst:2.1]|
+---+-----------------+-----------------+

Or, if you want exactally in that format:

for x in report.select('key', 'difference_val11', 'difference_val12').collect():
    print("key: " + str(x[0]) + ",\n\n" +\
          "col:          val11                 val12\n\n" +\
         "difference:   " + str(x[1]) + "     " + str(x[2]))

Output:

key: abc,

col:          val11                 val12

difference:   [src:1.1,dst:2.1]     [src:1.2,dst:2.2]

edited Mar 08 '22 at 13:14

answered Mar 08 '22 at 12:57

Luiz Viola

2,143
1
11
30

Thanks @Luiz.. this works for me.. however can we use two column while joining ... for eg: source_df\ .join(dest_df, 'key','val', 'full')\ if need to join with one more column – cloud_hari Mar 10 '22 at 05:31
yes, you can do df1.join(df2, (df1.x1 == df2.x1) & (df1.x2 == df2.x2)) – Luiz Viola Mar 10 '22 at 06:11
thanks @Luiz.. i realized one issue from above approach is that i need to populate the result only when there is a difference . so difference column only create when there is a change.. – cloud_hari Mar 10 '22 at 06:45
HI @Luiz.. i created a new question for the issue observed on populating only required column for difference identified below https://stackoverflow.com/questions/71421220/pyspark-how-dynamically-populate-new-columns-after-it-is-filtered-with-condition – cloud_hari Mar 11 '22 at 10:05
I saw, but i didn't understand. The result already only shows the differences – Luiz Viola Mar 11 '22 at 10:14
i updated question.. but i need to only populate the result when there is a difference. if there there is no difference , additional column should not created.. – cloud_hari Mar 11 '22 at 10:17
HI @Luiz .. i have updated the question with required input.. can you please have a look – cloud_hari Mar 13 '22 at 06:31

Pyspark how can identify unmatched row value from two data frame

1 Answers1