1

I want to join 2 DF but before that I want to compare date to end and start by Id :

Inputs :

data1 = [ ('L1','N020','EY','2004-09-14','2010-12-01'),
      ('L1','N020','EY','2010-12-01','2015-12-23'),
      ('L1','N020','EY','2015-12-23','2018-04-20'),
      ('L2','N020','EY','2003-09-25','2004-09-14')]
data2 = [('N020','EY','2006-09-14'),
         ('N020','EY','2019-12-01'),
         ('N020','EY','2012-12-23'),
         ('N020','EY','2002-09-25')]


columns1 = ["id","Leef","code","start","end"]
columns2 = ["Leef","code","date"]

df1 = spark.createDataFrame(data=data1, schema = columns1)
df2 = spark.createDataFrame(data=data2, schema = columns2)
+---------+----+----+----------+-----------+
|id       |Leef|code|    start |       end |
+---------+----+----+----------+----------+
|       L1|N020| EY |2004-09-14|2010-12-01|
|       L1|N020| EY |2010-12-01|2015-12-23|
|       L1|N020| EY |2015-12-23|2018-04-20|
|       L2|N020| EY |2003-09-25|2004-09-14|
+------------+------+----+----------+----------+

+----+----+----------+
|Leef|Code|   date   |
+----+----+----------+
|N020| EY|2006-09-14|
|N020| EY|2019-12-01|
|N020| EY|2012-12-23|
|N020| EY|2002-09-25|

    df = df1.join(
        F.broadcast(df2),
        ["Leef", "code"],
        "inner"
    ).dropDuplicates()
df.show()

The resuls is :

+------+----+------+----------+----------+----------+
|Leef  |code|id    |     start|       end|   date   |
+----+---+---------+----------+----------+----------+
|N020| EY|       L1|2015-12-23|2018-04-20|2019-12-01|
|N020| EY|       L1|2010-12-01|2015-12-23|2012-12-23|
|N020| EY|       L1|2010-12-01|2015-12-23|2006-09-14|
|N020| EY|       L1|2015-12-23|2018-04-20|2012-12-23|
|N020| EY|       L1|2004-09-14|2010-12-01|2012-12-23|
|N020| EY|       L2|2003-09-25|2004-09-14|2002-09-25|
|N020| EY|       L1|2015-12-23|2018-04-20|2006-09-14|
|N020| EY|       L1|2004-09-14|2010-12-01|2019-12-01|
|N020| EY|       L1|2004-09-14|2010-12-01|2006-09-14|
|N020| EY|       L1|2004-09-14|2010-12-01|2002-09-25|
|N020| EY|       L1|2015-12-23|2018-04-20|2002-09-25|
|N020| EY|       L2|2003-09-25|2004-09-14|2012-12-23|
|N020| EY|       L1|2010-12-01|2015-12-23|2002-09-25|
|N020| EY|       L1|2010-12-01|2015-12-23|2019-12-01|
|N020| EY|       L2|2003-09-25|2004-09-14|2019-12-01|
|N020| EY|       L2|2003-09-25|2004-09-14|2006-09-14|
+----+----+--------+----------+----------+----------+

I want a join by Leef and Code only when df1.start < df2.date < df1.end the problem takes also the date from df1.id= L2 and compare it to end and start of df1.id= L1 excpected output :

+------+----+------+----------+----------+----------+
|   id |code|id    |     start|       end|   date   |
+----+---+---------+----------+----------+----------+
|N020| EY|       L1|2010-12-01|2015-12-23|2012-12-23|
|N020| EY|       L1|2004-09-14|2010-12-01|2006-09-14|
+----+----+--------+----------+----------+----------+

THANK YOU

BADS
  • 141
  • 8

1 Answers1

1

You can use alias and join here with conditions:

from pyspark.sql import functions as F

c1 = [f"df1.{i}" for i in df1.columns if i not in ['Leef','Code']]
c2 = [f"df2.{i}" for i in df2.columns]

cond = [F.col("df1.Leef")==F.col("df2.Leef"),F.col("df1.code")==F.col("df2.code"),
        F.col("df2.date")<F.col("df1.end"),F.col("df2.date")>F.col("df1.start")]

df2.alias("df2").join(df1.alias("df1"),on=cond).select(c2+c1).show()

+----+----+----------+---+----+----------+----------+
|Leef|code|      date| id|code|     start|       end|
+----+----+----------+---+----+----------+----------+
|N020|  EY|2006-09-14| L1|  EY|2004-09-14|2010-12-01|
|N020|  EY|2012-12-23| L1|  EY|2010-12-01|2015-12-23|
+----+----+----------+---+----+----------+----------+
anky
  • 74,114
  • 11
  • 41
  • 70
  • this really works and thank you for your help, can you please add some explanations for the 2,3 first lines and the last one , thank you. – BADS Aug 13 '21 at 11:16
  • @BADS since we are applying conditions in the join with different column names from df1 and df2, we need an alias here. So I am prepending the alias name to form c1 and c2 so that we can use them in select. using `fstrings`, refer to : https://stackoverflow.com/questions/50455784/python-and-f-strings-explanation – anky Aug 13 '21 at 12:29