0

So i have two data frame . Data Frame 1 like this :

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877341|   136|        4|       1|  I|!||
|4295877346|   136|        4|       1|  I|!||
|4295877341|   138|        2|       1|  I|!||
|4295877341|   141|        4|       1|  I|!||
|4295877341|   143|        2|       1|  I|!||
|4295877341|   145|       14|       1|  I|!||
| 123456789|   145|       14|       1|  I|!||
| 809580109|   145|       9|        9|  I|!||
+----------+------+---------+--------+------+

DataFrame2 is like below

+----------+------+-----------+----------+--------+
|     OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
+----------+------+-----------+----------+--------+
|4295877343|   149|         15|         2|    I|!||
|4295877341|   136|       null|      null|    I|!||
| 123456789|   145|         14|         1|    D|!||
|4295877341|   138|         11|        22|    I|!||
|4295877341|   141|         10|         1|    I|!||
|4295877341|   143|          1|         1|    I|!||
| 809580109|   145|       NULL|      NULL|    I|!||
+----------+------+-----------+----------+--------+

Now i have to join both data frame update data frame 1 column with matching records with data frame 2 .

Now key in both data frame is OrgId and ItemId.

So the expected output should be .

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877346|   136|        4|       1|  I|!||
|4295877341|   145|       14|       1|  I|!||
|4295877343|   149|       15|       2|  I|!||
|4295877341|   136|     null|    null|  I|!||
|4295877341|   138|       11|      22|  I|!||
|4295877341|   141|       10|       1|  I|!||
|4295877341|   143|        1|       1|  I|!||
| 809580109|   145|       9|        9|  I|!||
+----------+------+---------+--------+------+

So i need to update data frame 1 with data frame 2 records . If records in data frame 1 is not found in 2 then also we need to retain that records . If any new records are found in dataframe 2 then that records needs to added in the output

Here is what i am doing ..

val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
  .select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
  .filter(!$"Action_1".contains("D"))
    df3.show()

But i am getting below output .

+----------+------+-----------+----------+--------+
|     OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
+----------+------+-----------+----------+--------+
|4295877343|   149|         15|         2|    I|!||
|4295877341|   136|       null|      null|    I|!||
|4295877341|   138|         11|        22|    I|!||
|4295877341|   141|         10|         1|    I|!||
|4295877341|   143|          1|         1|    I|!||
+----------+------+-----------+----------+--------+

I am not getting 4295877346| 136| 4| 1| I|!| record from data frame 1 ...

left_outer gives me below output

+----------+------+-----------+----------+--------+
|     OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
+----------+------+-----------+----------+--------+
|4295877341|   136|       null|      null|    I|!||
|4295877341|   138|         11|        22|    I|!||
|4295877341|   141|         10|         1|    I|!||
|4295877341|   143|          1|         1|    I|!||
+----------+------+-----------+----------+--------+

2 Answers2

2

Let me explain first whats your mistake.

if you only join as below

val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
df3.show()

You will get

+----------+------+---------+--------+------+-----------+----------+--------+
|     OrgId|ItemId|segmentId|Sequence|Action|segmentId_1|Sequence_1|Action_1|
+----------+------+---------+--------+------+-----------+----------+--------+
|4295877346|   136|        4|       1|  I|!||       null|      null|    null|
|4295877341|   145|       14|       1|  I|!||       null|      null|    null|
|4295877343|   149|     null|    null|  null|         15|         2|    I|!||
|4295877341|   136|        4|       1|  I|!||       null|      null|    I|!||
| 123456789|   145|       14|       1|  I|!||         14|         1|    D|!||
|4295877341|   138|        2|       1|  I|!||         11|        22|    I|!||
|4295877341|   141|        4|       1|  I|!||         10|         1|    I|!||
|4295877341|   143|        2|       1|  I|!||          1|         1|    I|!||
+----------+------+---------+--------+------+-----------+----------+--------+

It is full evident that the filter in your code is filtering the null as well in Action_1 column

So the working code for you is to change the null values that you get after you join to valid data from other table where the data is present.

val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
  .withColumn("segmentId_1", when($"segmentId_1".isNotNull, $"segmentId_1").otherwise($"segmentId"))
  .withColumn("Sequence_1", when($"Sequence_1".isNotNull, $"Sequence_1").otherwise($"Sequence"))
  .withColumn("Action_1", when($"Action_1".isNotNull, $"Action_1").otherwise($"Action"))
  .select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
  .filter(!$"Action_1".contains("D") )
df3.show()

you should be getting the desired output as

+----------+------+-----------+----------+--------+
|     OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
+----------+------+-----------+----------+--------+
|4295877346|   136|          4|         1|    I|!||
|4295877341|   145|         14|         1|    I|!||
|4295877343|   149|         15|         2|    I|!||
|4295877341|   136|       null|      null|    I|!||
|4295877341|   138|         11|        22|    I|!||
|4295877341|   141|         10|         1|    I|!||
|4295877341|   143|          1|         1|    I|!||
+----------+------+-----------+----------+--------+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • There is slight change in the records ...If suppose for one specific columns if we get null value like last records in the DF1 and update for DF1 last records in in DF2 is null then we have to retain the DF1 value only for that columns ..How can we do that ? –  Oct 24 '17 at 09:25
  • sorry to tell you @Anupam, I didn't understand clearly your requirement. Can you explain a little bit more. – Ramesh Maharjan Oct 24 '17 at 09:37
  • So there are some columns which are optional which means when that column value is null in that case we dont have to update with the DF2 column value ..null as string –  Oct 24 '17 at 10:10
  • 1
    modify the when condition according to your needs as i have done in my answer above – Ramesh Maharjan Oct 24 '17 at 11:43
  • Hi...I know it is silly question to ask ..But suppose i have null in the place of primary key ..For example null value in the place of `ItemId` in that case how can i solve this problem ..I know primary key can not be null but there is situation where i will get part of the primary key as null ... –  Nov 07 '17 at 08:11
  • Can i create separate question for this? –  Nov 07 '17 at 08:12
  • which dataframe is it? first or the second? – Ramesh Maharjan Nov 07 '17 at 08:18
  • if you have null then it will join with nulls of other dataframe. There won't be any issue. did you get one? – Ramesh Maharjan Nov 07 '17 at 08:22
  • Oh ok yes i will be getting null so just thought to ask ... I will just try to join with null and then will see ..If you are confirming i think then it will work –  Nov 07 '17 at 08:24
  • if you encounter any issues do let me know or you can ask another question. But confirm it yourself by testing on test data. good luck. :) – Ramesh Maharjan Nov 07 '17 at 08:25
  • Hello sir my colleague created separate question with an example .Please have a look ...https://stackoverflow.com/questions/47269172/get-latest-records-in-a-data-frame-based-on-time-stamp-with-condition –  Nov 14 '17 at 05:00
0

Try left-outer instead of outer:

val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "left_outer")
  .select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
  .filter(!$"Action_1".contains("D"))
    df3.show()

Left outer should retain all non matched in the left.

A nice tutorial here.