-2

I have two datasets as

DATASET1

+-------+--------------------+  
|     id|                name|  
+-------+--------------------+  
|S703401|      Ryan P Cassidy|  
|S703401|Christopher J Mat...|  
|S703401|      Frank E LaSota|    
|S703401|      Ryan P Cassidy|  
|S703401|Anthony L Locricchio|  
|S703401|         Jason Monte|  
+-------+--------------------+  

DATASET2

+-------+------+  
|     id|   nic|  
+-------+------+    
|S703401|  RC82|  
|S703401|    NA|  
|S703401|   FL3|  
|S703401|  RC82|  
|S703401|    NA|  
|S703401|JM2080|  
+-------+------+

and i want to join them on id so that i can have output as

+-------+--------------------+-----------+  
|     id|                name|       nic |   
+-------+--------------------+-----------+  
|S703401|      Ryan P Cassidy|       RC82|  
|S703401|Christopher J Mat...|         NA|  
|S703401|      Frank E LaSota|        FL3|  
|S703401|      Ryan P Cassidy|       RC82|  
|S703401|Anthony L Locricchio|         NA|  
|S703401|         Jason Monte|     JM2080|  
+-------+--------------------+-----------+  

I am using java spark Dataset joined = dataset1.join(dataset2,"id"); but them i am getting cartesian product for all the rows like

+-------+--------------------+------+  
|     id |                 name|   nic|  
+-------+--------------------+------+  
|S703401|      Ryan P Cassidy|JM2080|  
|S703401|      Ryan P Cassidy|    NA|  
|S703401|      Ryan P Cassidy|  RC82|  
|S703401|      Ryan P Cassidy|   FL3|  
|S703401|      Ryan P Cassidy|    NA|  
|S703401|      Ryan P Cassidy|  RC82|  
|S703401|Christopher J Mat...|JM2080|  
|S703401|Christopher J Mat...|    NA|  
|S703401|Christopher J Mat...|  RC82|  
|S703401|Christopher J Mat...|   FL3|  
|S703401|Christopher J Mat...|    NA|  
|S703401|Christopher J Mat...|  RC82|  
|S703401|      Frank E LaSota|JM2080|  
|S703401|      Frank E LaSota|    NA|  
|S703401|      Frank E LaSota|  RC82|  
|S703401|      Frank E LaSota|   FL3|  
|S703401|      Frank E LaSota|    NA|  
|S703401|      Frank E LaSota|  RC82|  
|S703401|      Ryan P Cassidy|JM2080|  
|S703401|      Ryan P Cassidy|    NA|  
+-------+--------------------+------+   

So what am i missing here?

Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24
zzz
  • 15
  • 5
  • 1
    you have multiple same id in both datasets so it is obvious that you will get many to many relation. Do you have any logic on which you can decide your desire output? – Nikhil Suthar Feb 19 '20 at 06:06
  • i want to add nic values in dataset1 where dataset1.id = dataset2.id, so i can get 6 output only – zzz Feb 19 '20 at 06:16

1 Answers1

0

Do not use ,monotonically_increasing_id function to generate rowId. It gives random number which will fail your join logic. So go with row_number function that will give you correct rowId in all your cases.

scala> import org.apache.spark.sql.expressions.Window

scala> df.show()
+-------+------------------+
|     id|              name|
+-------+------------------+
|S703401|      RyanPCassidy|
|S703401|ChristopherJMat...|
|S703401|      FrankELaSota|
|S703401|      RyanPCassidy|
|S703401|AnthonyLLocricchio|
|S703401|        JasonMonte|
+-------+------------------+


scala> df1.show()
+-------+------+
|     id|   nic|
+-------+------+
|S703401|  RC82|
|S703401|    NA|
|S703401|   FL3|
|S703401|  RC82|
|S703401|    NA|
|S703401|JM2080|
+-------+------+


scala> val w = Window.partitionBy(col("id")).orderBy(lit(1))

scala> df.withColumn("rw", row_number.over(w))
         .join(df1.withColumn("rw", row_number.over(w)), List("id", "rw"),"left")
         .drop("rw")
         .show(false)
+-------+------------------+------+
|id     |name              |nic   |
+-------+------------------+------+
|S703401|RyanPCassidy      |RC82  |
|S703401|ChristopherJMat...|NA    |
|S703401|FrankELaSota      |FL3   |
|S703401|RyanPCassidy      |RC82  |
|S703401|AnthonyLLocricchio|NA    |
|S703401|JasonMonte        |JM2080|
+-------+------------------+------+
Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24