3

I need help for this case to fill, with a new row, missing values:

This is just an example, but I have a lot of rows with different IDs.

Input dataframe:

ID FLAG DATE
123 1 01/01/2021
123 0 01/02/2021
123 1 01/03/2021
123 0 01/06/2021
123 0 01/08/2021
777 0 01/01/2021
777 1 01/03/2021

So I have a finite set of dates and I wanna take until the last one for each ID (in the example, for ID = 123: 01/01/2021, 01/02/2021, 01/03/2021... until 01/08/2021). So basically I could do a cross join with a calendar, but I don't know how can I fill missing value with a rule or a filter, after the cross join.

Expected output: (in bold the generated missing values)

ID FLAG DATE
123 1 01/01/2021
123 0 01/02/2021
123 1 01/03/2021
123 1 01/04/2021
123 1 01/05/2021
123 0 01/06/2021
123 0 01/07/2021
123 0 01/08/2021
777 0 01/01/2021
777 0 01/02/2021
777 1 01/03/2021
blackbishop
  • 30,945
  • 11
  • 55
  • 76
Jresearcher
  • 297
  • 3
  • 13
  • Can you eloborate on the logic behind the date for the new row being `04/04/2021` for new row created after `03/03/2021`? Also is the dates a finite set? (01/01/2021, 02/02/2021, 03/04/2021 .... 11/11/2021, 12/12/2021)? – Nithish Dec 24 '21 at 06:37
  • yes, I wrote these dates to give an example ... but I have all the months with the first day of the month (I rectified it)... so i have a finite set of elements and i wanna take until the last one for that ID (in the example, for ID 123: 01/01/2021, 01/02/2021, 01/03/2021... until 01/08/2021). So basically i could do a cross join with a calendar, but i don't know how can i fill missing value with a rule or a filter, after the crossjoin. – Jresearcher Dec 24 '21 at 10:42

2 Answers2

9

You can first group by id to calculate max and min date then using sequence function, generate all the dates from min_date to max_date. Finally, join with original dataframe and fill nulls with last non null per group of id. Here's a complete working example:

Your input dataframe:

from pyspark.sql import Window
import pyspark.sql.functions as F

df = spark.createDataFrame([
    (123, 1, "01/01/2021"), (123, 0, "01/02/2021"),
    (123, 1, "01/03/2021"), (123, 0, "01/06/2021"),
    (123, 0, "01/08/2021"), (777, 0, "01/01/2021"),
    (777, 1, "01/03/2021")
], ["id", "flag", "date"])

Groupby id and generate all possible dates for each id:

all_dates_df = df.groupBy("id").agg(
    F.date_trunc("mm", F.max(F.to_date("date", "dd/MM/yyyy"))).alias("max_date"),
    F.date_trunc("mm", F.min(F.to_date("date", "dd/MM/yyyy"))).alias("min_date")
).select(
    "id",
    F.expr("sequence(min_date, max_date, interval 1 month)").alias("date")
).withColumn(
    "date", F.explode("date")
).withColumn(
    "date",
    F.date_format("date", "dd/MM/yyyy")
)

Now, left join with df and use last function over a Window partitioned by id to fill null values:

w = Window.partitionBy("id").orderBy("date")

result = all_dates_df.join(df, ["id", "date"], "left").select(
    "id",
    "date",
    *[F.last(F.col(c), ignorenulls=True).over(w).alias(c)
      for c in df.columns if c not in ("id", "date")
     ]
)

result.show()
#+---+----------+----+
#| id|      date|flag|
#+---+----------+----+
#|123|01/01/2021|   1|
#|123|01/02/2021|   0|
#|123|01/03/2021|   1|
#|123|01/04/2021|   1|
#|123|01/05/2021|   1|
#|123|01/06/2021|   0|
#|123|01/07/2021|   0|
#|123|01/08/2021|   0|
#|777|01/01/2021|   0|
#|777|01/02/2021|   0|
#|777|01/03/2021|   1|
#+---+----------+----+
blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • This is amazing!! How did I not think about it... maybe I didn't know the "last" function! :) Just a curiosity: just a curiosity: how could I paste the entire previous row instead of just the "flag" field? (in case I also have other variables). Thank you very much, you were so precious! Merry Christmas! – Jresearcher Dec 24 '21 at 12:19
  • can u write the entire code row please? – Jresearcher Dec 24 '21 at 12:50
5

You can find the ranges of dates between the DATE value in the current row and the following row and then use sequence to generate all intermediate dates and explode this array to fill in values for the missing dates.

from pyspark.sql import functions as F
from pyspark.sql import Window

data = [(123, 1, "01/01/2021",),
        (123, 0, "01/02/2021",),
        (123, 1, "01/03/2021",),
        (123, 0, "01/06/2021",),
        (123, 0, "01/08/2021",),
        (777, 0, "01/01/2021",),
        (777, 1, "01/03/2021",), ]

df = spark.createDataFrame(data, ("ID", "FLAG", "DATE",)).withColumn("DATE", F.to_date(F.col("DATE"), "dd/MM/yyyy"))

window_spec = Window.partitionBy("ID").orderBy("DATE")

next_date = F.coalesce(F.lead("DATE", 1).over(window_spec), F.col("DATE") + F.expr("interval 1 month"))

end_date_range = next_date - F.expr("interval 1 month")


df.withColumn("Ranges", F.sequence(F.col("DATE"), end_date_range, F.expr("interval 1 month")))\
  .withColumn("DATE", F.explode("Ranges"))\
  .withColumn("DATE", F.date_format("date", "dd/MM/yyyy"))\
  .drop("Ranges").show(truncate=False)

Output

+---+----+----------+
|ID |FLAG|DATE      |
+---+----+----------+
|123|1   |01/01/2021|
|123|0   |01/02/2021|
|123|1   |01/03/2021|
|123|1   |01/04/2021|
|123|1   |01/05/2021|
|123|0   |01/06/2021|
|123|0   |01/07/2021|
|123|0   |01/08/2021|
|777|0   |01/01/2021|
|777|0   |01/02/2021|
|777|1   |01/03/2021|
+---+----+----------+
Nithish
  • 3,062
  • 2
  • 8
  • 16
  • @blackbishop that's weird, I ran the code and copied the results I got as is, on ` Spark v3.2.0` – Nithish Dec 24 '21 at 12:14
  • Thank u Nithish! I can try also your code asap! You have been precious! – Jresearcher Dec 24 '21 at 12:20
  • 2
    I confirm, setting spark session timezone to `UTC` rather than `Europe/Paris` gives the correct result ;) – blackbishop Dec 24 '21 at 12:24
  • This solution doesn't work for me cause "add_month is deprecated"... how can i solve it? – Jresearcher Dec 24 '21 at 12:56
  • @Jresearcher can you elaborate on the error, since [`add_months`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.add_months.html#pyspark.sql.functions.add_months) does not appear to be deprecated. – Nithish Dec 24 '21 at 13:00
  • The add_months function is temporarily disallowed because it silently changes behaviour between spark 2 and spark 3. add_months(DATE'2021-02-28', 1) is '2021-03-31' in spark 2, but '2021-03-28' in spark 3. As workaround, add intervals instead, e.g. DATE'2021-02-28' + INTERVAL 1 month. – Jresearcher Dec 24 '21 at 13:12
  • @Jresearcher edited answer to not use `add_months`. In addition the difference is behavior is only for month end as specified [here](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30) – Nithish Dec 24 '21 at 13:34
  • @Jresearcher can you confirm if the small dataset in the answer can execute without error? If it does then, it would be good to check if the dates in your input are parsed correctly and also the range of dates you are dealing with. – Nithish Dec 24 '21 at 14:06
  • I have extended my example to the real dataset with many rows. I also need to check the last days of the month, so: 31/01/2021, 28/02/2021 etc. The problem remains the same of the question but with the last month's dates. Thank u for ur patience – Jresearcher Dec 24 '21 at 14:08
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/240419/discussion-between-nithish-and-jresearcher). – Nithish Dec 24 '21 at 14:20