Compare two dataset and get what fields are changed

Question

I am working on a spark using Java, where I will download data from api and compare with mongodb data, while the downloaded json have 15-20 fields but database have 300 fields.

Now my task is to compare the downloaded jsons to mongodb data, and get whatever fields changed with past data.

Sample data set

Downloaded data from API

StudentId,Name,Phone,Email
1,tony,123,a@g.com
2,stark,456,b@g.com
3,spidy,789,c@g.com

Mongodb data

StudentId,Name,Phone,Email,State,City
1,tony,1234,a@g.com,NY,Nowhere
2,stark,456,bg@g.com,NY,Nowhere
3,spidy,789,c@g.com,OH,Nowhere

I can't use the except, because of column length.

Expected output

StudentId,Name,Phone,Email,Past_Phone,Past_Email
1,tony,1234,a@g.com,1234, //phone number only changed 
2,stark,456,b@g.com,,bg@g.com //Email only changed
3,spidy,789,c@g.com,,

noufel13 · Accepted Answer · 2019-08-20T18:42:27.780

Consider your data is in 2 dataframes. We can create temporary views for them, as shown below,

api_df.createOrReplaceTempView("api_data")
mongo_df.createOrReplaceTempView("mongo_data")

Next we can use Spark SQL. Here, we join both these views using the StudentId column and then use a case statement on top of them to compute the past phone number and email.

spark.sql("""
select a.*
, case when a.Phone = b.Phone then '' else b.Phone end as Past_phone
, case when a.Email = b.Email then '' else b.Email end as Past_Email
from api_data a
join mongo_data b
on a.StudentId = b.StudentId
order by a.StudentId""").show()

Output:

+---------+-----+-----+-------+----------+----------+
|StudentId| Name|Phone|  Email|Past_phone|Past_Email|
+---------+-----+-----+-------+----------+----------+
|        1| tony|  123|a@g.com|      1234|          |
|        2|stark|  456|b@g.com|          |  bg@g.com|
|        3|spidy|  789|c@g.com|          |          |
+---------+-----+-----+-------+----------+----------+

score 0 · Answer 2 · answered Aug 20 '19 at 12:19

Please find the below same source code. Here I am taking the only phone number condition as an example.

val list = List((1,"tony",123,"a@g.com"), (2,"stark",456,"b@g.com") 
       (3,"spidy",789,"c@g.com"))
val df1 = list.toDF("StudentId","Name","Phone","Email")
     .select('StudentId as "StudentId_1", 'Name as "Name_1",'Phone as "Phone_1", 
     'Email as "Email_1")

df1.show()

val list1 = List((1,"tony",1234,"a@g.com","NY","Nowhere"), 
     (2,"stark",456,"bg@g.com", "NY", "Nowhere"),
     (3,"spidy",789,"c@g.com","OH","Nowhere"))

val df2 = list1.toDF("StudentId","Name","Phone","Email","State","City")
      .select('StudentId as "StudentId_2", 'Name as "Name_2", 'Phone as "Phone_2",
         'Email as "Email_2", 'State as "State_2", 'City as "City_2")

df2.show()

val df3 = df1.join(df2, df1("StudentId_1") === 
         df2("StudentId_2")).where(df1("Phone_1") =!= df2("Phone_2"))

df3.withColumnRenamed("Phone_1", "Past_Phone").show()

+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
|          1|  tony|    123|a@g.com|
|          2| stark|    456|b@g.com|
|          3| spidy|    789|c@g.com|
+-----------+------+-------+-------+

+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
|          1|  tony|   1234| a@g.com|     NY|Nowhere|
|          2| stark|    456|bg@g.com|     NY|Nowhere|
|          3| spidy|    789| c@g.com|     OH|Nowhere|
+-----------+------+-------+--------+-------+-------+

+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
|StudentId_1|Name_1|Past_Phone|Email_1|StudentId_2|Name_2|Phone_2|Email_2|State_2| City_2|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
|          1|  tony|       123|a@g.com|          1|  tony|   1234|a@g.com|     NY|Nowhere|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+

Thank you fro the quick answer but,my expected table should not have state,city.Should be showing only fields which changed if not empty.I advance I don't know which fields from df1 are changing , so it should be dynamic — Ratheri2, Aug 20 '19 at 12:28
Okay. Do we have any specific column which will always exist? — Ravi, Aug 20 '19 at 15:35

score 0 · Answer 3 · answered Aug 21 '19 at 08:45

We have :

df1.show
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
|          1|  tony|    123|a@g.com|
|          2| stark|    456|b@g.com|
|          3| spidy|    789|c@g.com|
+-----------+------+-------+-------+

df2.show

+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
|          1|  tony|   1234| a@g.com|     NY|Nowhere|
|          2| stark|    456|bg@g.com|     NY|Nowhere|
|          3| spidy|    789| c@g.com|     OH|Nowhere|
+-----------+------+-------+--------+-------+-------+

After Join :

var jn = df2.join(df1,df1("StudentId_1")===df2("StudentId_2"))

Then

var ans = jn.withColumn("Past_Phone", when(jn("Phone_2").notEqual(jn("Phone_1")),jn("Phone_1")).otherwise("")).withColumn("Past_Email", when(jn("Email_2").notEqual(jn("Email_1")),jn("Email_1")).otherwise(""))

Reference : Spark: Add column to dataframe conditionally

ans.select(ans("StudentId_2") as "StudentId",ans("Name_2") as "Name",ans("Phone_2") as "Phone",ans("Email_2") as "Email",ans("Past_Email"),ans("Past_Phone")).show

+---------+-----+-----+--------+----------+----------+
|StudentId| Name|Phone|   Email|Past_Email|Past_Phone|
+---------+-----+-----+--------+----------+----------+
|        1| tony| 1234| a@g.com|          |       123|
|        2|stark|  456|bg@g.com|   b@g.com|          |
|        3|spidy|  789| c@g.com|          |          |
+---------+-----+-----+--------+----------+----------+

Compare two dataset and get what fields are changed

Sample data set

3 Answers3