I have a dataframe that looks something like this:
df
Name date season binary
Apple 2016-04-24 MAM 1
Banana 2016-05-01 MAM 1
Berry 2016-05-01 MAM 1
Berry 2017-11-05 OND 1
Orange 2016-11-27 OND 1
Kiwi 2017-03-27 MAM 1
And I have a long date time object of a bunch of dates like this:
df1
date season Week
2016-03-28 MAM 13
2016-04-04 MAM 14
2016-04-11 MAM 15
2016-04-18 MAM 16
2016-04-25 MAM 17
2016-05-02 MAM 18
2016-05-09 MAM 19
2016-05-16 MAM 20
2016-05-23 MAM 21
2016-05-30 MAM 22
2016-06-06 MAM 23
2016-06-13 MAM 24
2016-10-03 OND 40
2016-10-10 OND 41
2016-10-17 OND 42
2016-10-24 OND 43
2016-10-31 OND 44
2016-11-07 OND 45
2016-11-14 OND 46
2016-11-21 OND 47
2017-03-13 MAM 11
2017-03-20 MAM 12
2017-03-27 MAM 13
2017-04-03 MAM 14
2017-04-10 MAM 15
2017-04-17 MAM 16
2017-04-24 MAM 17
2017-05-01 MAM 18
2017-05-08 MAM 19
2017-05-15 MAM 20
2017-05-22 MAM 21
2017-05-29 MAM 22
2017-06-19 MAM 25
2017-06-26 MAM 26
2017-07-03 MAM 27
2017-07-10 MAM 28
2017-07-17 MAM 29
2017-07-24 MAM 30
2017-07-31 MAM 31
2017-08-07 MAM 32
2017-08-14 MAM 33
2017-08-21 MAM 34
2017-08-28 MAM 35
2017-09-04 MAM 36
2017-09-11 MAM 37
2017-09-18 MAM 38
2017-09-25 MAM 39
2017-10-02 OND 40
2017-10-09 OND 41
2017-10-16 OND 42
2017-10-23 OND 43
2017-10-30 OND 44
2017-11-06 OND 45
2017-11-13 OND 46
2017-11-20 OND 47
where Week
is dtype: object, date
is dtype: datetime64[ns], and season
is dtype: object.
What I need is a bit complicated. I need a function or algorithm that recognizes df.date
as the "end date" and df1.date
as the "start date" for certain seasons. In other words, an event for "Apple" in df
started on 2016-04-24 which is indicated by the binary=1
. In a new dataframe, I need to include the weeks before 2016-04-24 but after 2016-03-28 and add a column that says binary=0
because the event had not occurred yet.
The messy thing is that there a four seasons: MAM 2016, OND 2016, MAM 2017, and OND 2017. And the "start date" for each df.date
should only be considered for that season.
As an example, this is what I'm looking to get:
df_result
Name date season binary
Apple 2016-03-28 MAM 0
Apple 2016-04-04 MAM 0
Apple 2016-04-11 MAM 0
Apple 2016-04-18 MAM 0
Apple 2016-04-24 MAM 1
Banana 2016-03-28 MAM 0
Banana 2016-04-04 MAM 0
Banana 2016-04-11 MAM 0
Banana 2016-04-18 MAM 0
Banana 2016-04-25 MAM 0
Banana 2016-05-01 MAM 1
Berry 2016-03-28 MAM 0
Berry 2016-04-04 MAM 0
Berry 2016-04-11 MAM 0
Berry 2016-04-18 MAM 0
Berry 2016-04-25 MAM 0
Berry 2016-05-01 MAM 1
Berry 2017-10-02 OND 0
Berry 2017-10-09 OND 0
Berry 2017-10-16 OND 0
Berry 2017-10-23 OND 0
Berry 2017-10-30 OND 0
Berry 2017-11-05 OND 1
Orange 2016-10-03 OND 0
Orange 2016-10-10 OND 0
Orange 2016-10-17 OND 0
Orange 2016-10-24 OND 0
Orange 2016-10-31 OND 0
Orange 2016-11-07 OND 0
Orange 2016-11-14 OND 0
Orange 2016-11-21 OND 0
Orange 2016-11-27 OND 1
Kiwi 2017-03-13 MAM 0
Kiwi 2017-03-20 MAM 0
Kiwi 2017-03-27 MAM 1
To summarize the seasons in start_end_dates
and make it a little clearer, these are the dates:
- MAM 2016: 2016-03-28 to 2016-06-13
- OND 2016: 2016-10-03 to 2016-11-21
- MAM 2017: 2017-03-13 to 2017-09-25
- OND 2017: 2017-10-02 to 2017-11-20
With some help, I know that this I should reindex df
with something like df2 = df.reindex(start_end_dates)
but this doesn't work out quite right.
Then I should groupby the names and blank fill with "0" the weeks before df.date
. So maybe: df3 = df3.loc[df3.groupby(level=0).binary.bfill().dropna().index].fillna(0).reset_index()
I have seen this stack overflow question but it doesn't exactly answer my question: Reindex pandas DataFrame to fill missing dates
Help greatly appreciated.