I want to join 2 DF but before that I want to compare date to end and start by Id :
Inputs :
data1 = [ ('L1','N020','EY','2004-09-14','2010-12-01'),
('L1','N020','EY','2010-12-01','2015-12-23'),
('L1','N020','EY','2015-12-23','2018-04-20'),
('L2','N020','EY','2003-09-25','2004-09-14')]
data2 = [('N020','EY','2006-09-14'),
('N020','EY','2019-12-01'),
('N020','EY','2012-12-23'),
('N020','EY','2002-09-25')]
columns1 = ["id","Leef","code","start","end"]
columns2 = ["Leef","code","date"]
df1 = spark.createDataFrame(data=data1, schema = columns1)
df2 = spark.createDataFrame(data=data2, schema = columns2)
+---------+----+----+----------+-----------+
|id |Leef|code| start | end |
+---------+----+----+----------+----------+
| L1|N020| EY |2004-09-14|2010-12-01|
| L1|N020| EY |2010-12-01|2015-12-23|
| L1|N020| EY |2015-12-23|2018-04-20|
| L2|N020| EY |2003-09-25|2004-09-14|
+------------+------+----+----------+----------+
+----+----+----------+
|Leef|Code| date |
+----+----+----------+
|N020| EY|2006-09-14|
|N020| EY|2019-12-01|
|N020| EY|2012-12-23|
|N020| EY|2002-09-25|
df = df1.join(
F.broadcast(df2),
["Leef", "code"],
"inner"
).dropDuplicates()
df.show()
The resuls is :
+------+----+------+----------+----------+----------+
|Leef |code|id | start| end| date |
+----+---+---------+----------+----------+----------+
|N020| EY| L1|2015-12-23|2018-04-20|2019-12-01|
|N020| EY| L1|2010-12-01|2015-12-23|2012-12-23|
|N020| EY| L1|2010-12-01|2015-12-23|2006-09-14|
|N020| EY| L1|2015-12-23|2018-04-20|2012-12-23|
|N020| EY| L1|2004-09-14|2010-12-01|2012-12-23|
|N020| EY| L2|2003-09-25|2004-09-14|2002-09-25|
|N020| EY| L1|2015-12-23|2018-04-20|2006-09-14|
|N020| EY| L1|2004-09-14|2010-12-01|2019-12-01|
|N020| EY| L1|2004-09-14|2010-12-01|2006-09-14|
|N020| EY| L1|2004-09-14|2010-12-01|2002-09-25|
|N020| EY| L1|2015-12-23|2018-04-20|2002-09-25|
|N020| EY| L2|2003-09-25|2004-09-14|2012-12-23|
|N020| EY| L1|2010-12-01|2015-12-23|2002-09-25|
|N020| EY| L1|2010-12-01|2015-12-23|2019-12-01|
|N020| EY| L2|2003-09-25|2004-09-14|2019-12-01|
|N020| EY| L2|2003-09-25|2004-09-14|2006-09-14|
+----+----+--------+----------+----------+----------+
I want a join by Leef and Code only when df1.start < df2.date < df1.end the problem takes also the date from df1.id= L2 and compare it to end and start of df1.id= L1 excpected output :
+------+----+------+----------+----------+----------+
| id |code|id | start| end| date |
+----+---+---------+----------+----------+----------+
|N020| EY| L1|2010-12-01|2015-12-23|2012-12-23|
|N020| EY| L1|2004-09-14|2010-12-01|2006-09-14|
+----+----+--------+----------+----------+----------+
THANK YOU