0

I have two data frames 1) Accounts and 2) Customers. The schema of accounts is as:

Name  Id    Telehone     Mob       email
 AR     1     123        1234      test1@gmail.com
 BR     2     213        4123      test2@gmail.com
 CR     3     231        3214      test3@gmail.com
 KR     4     132        1324      test4@gmail.com

Second table Customers as:

Id    Phone   Email
  2     2344    testq@gmail.com
  6     132     testf@gmail.com
  7     64562    test1@gmail.com

I need to join these two dataframes such that Id is matching Id OR Phone is matching Telephone OR Mob Or Email is matching email. In Above case in first row of Customers is matching on ID, Second is matching on phone and third on email. The join Should be left containin all records of Accounts.

SCouto
  • 7,808
  • 5
  • 32
  • 49
upenkas
  • 9
  • 4

3 Answers3

4

Check below code.

scala> accountDF.show(false)
+----+---+---------+----+---------------+
|name|id |telephone|mob |email          |
+----+---+---------+----+---------------+
|AR  |1  |123      |1234|test1@gmail.com|
|BR  |2  |213      |4123|test2@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|
|KR  |4  |132      |1324|test4@gmail.com|
+----+---+---------+----+---------------+
scala> customerDF.show(false)
+---+-----+---------------+
|id |phone|email          |
+---+-----+---------------+
|2  |2344 |testq@gmail.com|
|6  |132  |testf@gmail.com|
|7  |64562|test1@gmail.com|
+---+-----+---------------+
scala> accountDF.printSchema
root
 |-- name: string (nullable = true)
 |-- id: string (nullable = true)
 |-- telephone: string (nullable = true)
 |-- mob: string (nullable = true)
 |-- email: string (nullable = true)
scala> customerDF.printSchema
root
 |-- id: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- email: string (nullable = true)

scala> 

accountDF.join(customerDF,(accountDF("id") === customerDF("id") || (accountDF("telephone") === customerDF("phone") ||accountDF("mob") === customerDF("phone")) || accountDF("email") === customerDF("email")),"left").show(false)

+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email          |id  |phone|email          |
+----+---+---------+----+---------------+----+-----+---------------+
|AR  |1  |123      |1234|test1@gmail.com|7   |64562|test1@gmail.com|
|BR  |2  |213      |4123|test2@gmail.com|2   |2344 |testq@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|null|null |null           |
|KR  |4  |132      |1324|test4@gmail.com|6   |132  |testf@gmail.com|
+----+---+---------+----+---------------+----+-----+---------------+

Srinivas
  • 8,957
  • 2
  • 12
  • 26
  • 1
    I like it. just one correction needed. condition on mob/phone should be corrected as - (dfa("telehone") === dfb("phone") || dfa("mob") === dfb("phone")) – Shantanu Kher Jul 19 '20 at 12:43
  • ohh.. I didn't see that condition.. let me update thank you.. :) – Srinivas Jul 19 '20 at 12:44
  • 1
    since you have posted the solution using dataframe joins, i have tried to meet this requirement using spark sql. pls have a look and advise if any corrections are needed. – Shantanu Kher Jul 19 '20 at 12:44
  • 1
    think it will be good if you rename dfa and dfb to something like `accountDataFrame` and `customerDataFrame` or `accountDF` and `customerDF` – Mikhail Ionkin Jul 19 '20 at 14:42
  • Thank You, I have renamed ```dfa``` to ```accountDF``` & ```dfb``` to ```customerDF``` – Srinivas Jul 19 '20 at 14:47
1

You can easily meet this requirement with spark SQL.

Code to refer -

import org.apache.spark.sql.functions._

val accountdf = sc.parallelize(Seq(("AR",1,123,1234,"test1@gmail.com"),("BR", 2, 213, 4123, "test2@gmail.com"),("CR", 3, 231, 3214, "test3@gmail.com"),("KR", 4, 132, 1324, "test4@gmail.com"))).toDF("name","id","telephone","mob","email")


accountdf.createOrReplaceTempView("account")

val customerdf = sc.parallelize(Seq((2,2344,"testq@gmail.com"),(6,132,"testf@gmail.com"),(7,64562,"test1@gmail.com"))).toDF("id","phone","email")
   
customerdf.createOrReplaceTempView("customer")

sql("select * from account a left join customer c on a.id = c.id or (a.telephone = c.phone or a.mob = c.phone) or a.email = c.email").show(false)

+----+---+---------+----+---------------+----+-----+---------------+
|name|id |telephone|mob |email          |id  |phone|email          |
+----+---+---------+----+---------------+----+-----+---------------+
|BR  |2  |213      |4123|test2@gmail.com|2   |2344 |testq@gmail.com|
|KR  |4  |132      |1324|test4@gmail.com|6   |132  |testf@gmail.com|
|AR  |1  |123      |1234|test1@gmail.com|7   |64562|test1@gmail.com|
|CR  |3  |231      |3214|test3@gmail.com|null|null |null           |
+----+---+---------+----+---------------+----+-----+---------------+
Shantanu Kher
  • 1,014
  • 1
  • 8
  • 14
0
  val sourceDF = Seq(("AR",1,123,1234,"test1@gmail.com"),
    ("BR",2,213,4123,"test2@gmail.com"),
  ("CR",3,231,3214,"test3@gmail.com"),
  ("KR",4,132,1324,"test4@gmail.com")
  ).toDF("Name","Id","Telehone","Mob","email")


  val sourceDF2 = Seq((2,2344,"testq@gmail.com"),
    (6,132,"testf@gmail.com"),
    (7,64562,"test1@gmail.com")
  ).toDF("Id","Phone","Email")

  val joinDF = sourceDF.join(sourceDF2,
    sourceDF.col("Id") === sourceDF2.col("Id") ||
      (sourceDF.col("Telehone") === sourceDF2.col("Phone") ||
      sourceDF.col("Mob") === sourceDF2.col("Phone")) ||
      sourceDF.col("email") === sourceDF2.col("Email")
    ,
  "inner")
  // use "inner" or "left" or ...
mvasyliv
  • 1,214
  • 6
  • 10