0

I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.

month_df

| ID | month_end_date |
| -- | -------------- |
| 1  | 2019-07-31     |
| 1  | 2019-06-30     |
| 2  | 2019-10-31     |

daily_df

| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1  | 2019-07-29 | 1          |
| 1  | 2019-07-30 | 1          |
| 1  | 2019-08-01 | 2          |
| 1  | 2019-08-02 | 2          |
| 1  | 2019-08-03 | 2          |
| 1  | 2019-06-29 | 0          |
| 1  | 2019-06-30 | 0          |
| 2  | 2019-10-30 | 5          |
| 2  | 2019-10-31 | NULL       |
| 2  | 2019-11-01 | 6          |
| 2  | 2019-11-02 | 6          |

I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.

This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.

  • 1
    Does it answer your question [How do you create merge_asof functionality in PySpark?](https://stackoverflow.com/questions/57435858/how-do-you-create-merge-asof-functionality-in-pyspark)? – Mykola Zotko Jan 26 '21 at 20:41
  • That question does seem to be asking the same thing @MykolaZotko. I can't tell immediately whether either answer solves the problem efficiently, but I'll look there if I have need to return to this problem. Thanks! – ChrisSirico Feb 11 '21 at 13:43

0 Answers0