1

I'm trying to looking for a solution to one problem. I've the following df:

task open_date close_date
t01 05/03/2021 16/03/2021
t02 07/03/2021 13/04/2021
t03 23/03/2021 04/04/2021

I need a new df with the backlog at the end of the month (opened that month and closed after it):

end_of_month backlog
31/03/2021 2

I'm playing with last_day() to get the EOM but I'm not reaching with the query/code to getting the right count. Could you help me? Thanks in advance!!

David
  • 11
  • 2

2 Answers2

0
    df
    .withColumn('open_month_next_month', add_months('open_date', 1))
    .filter(last_day('open_month_next_month') == last_day('close_date'))
    .withColumn('end_of_month', last_day('open_date'))
    .groupBy('end_of_month')
    .agg(F.count(F.lit(1)).alias('backlog'))
NaWeeD
  • 561
  • 5
  • 15
  • Hi NaWeeD, thanks for the answer but I guess that only works if the close date is the next month but could be after 2 or more months. – David Jul 13 '21 at 21:23
  • @David Well I guess you should only change the `==` to `<=` right? – NaWeeD Jul 27 '21 at 19:26
0
from pyspark.sql import functions as F
from pyspark.sql.window import Window

data = [("t01", "05/03/2021", "16/03/2021"),
        ("t02", "07/03/2021", "13/04/2021"),
        ("t03", "23/03/2021", "04/04/2021"),
        ("t04", "07/03/2021", "13/05/2021"),
        ("t05", "23/03/2021", "04/05/2021")]
schema = ["TaskID", "TaskstartDate", "TaskEndDate"]

df = spark.createDataFrame(data, schema)

df = df.withColumn("TaskstartDate",F.to_date(F.col("TaskstartDate"), "dd/MM/yyyy"))\
.withColumn("TaskEndDate",F.to_date(F.col("TaskEndDate"), "dd/MM/yyyy"))

df_grouped = df.withColumn("EndDay_fromStartDate", F.last_day(F.col("TaskstartDate")))\
.withColumn("EndDay_fromEndDate", F.last_day(F.col("TaskEndDate")))\
.filter((F.col("EndDay_fromStartDate") != F.col("EndDay_fromEndDate")) & (F.col("TaskstartDate") < F.col("TaskEndDate")))\
.groupBy('EndDay_fromStartDate')\
.agg(F.count(F.lit(1)).alias('backlog'))

df_grouped.show()
Kris
  • 86
  • 1
  • 4
  • Thanks Kris! I think this is exactly what I need. I'll try it and I'll mark it as the solution if it works properly. – David Jul 16 '21 at 13:24
  • Finally, it isn't working at all. When I've a task which was opened on March an closed on June, it's only appearing in March's backlog and must appear also in April and May. – David Jul 23 '21 at 15:37