I want to track intervals of consecutive days where certain conditions are met over a timeframe where each User may have multiple occurrences (potential gaps in between) and each User may be missing some Date entries as well. I will show my process so far up to where I am stuck, but if there is a different/easier way to do the end goal, I am not attached to my beginning steps at all.
(The logic I am following is if the day before a missing entry is flagged and the day after a missing entry is flagged, we want to consider the missing date also flagged and calculate the total consecutive days difference.)
Here is an example DataFrame for what I am trying to do:
User = ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C']
Date = ['4/1/2023', '4/2/2023', '4/3/2023', '4/4/2023', '4/5/2023', '4/1/2023', '4/2/2023', '4/4/2023', '4/5/2023', '4/1/2023', '4/2/2023', '4/3/2023', '4/4/2023', '4/5/2023']
Flag = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1]
df = pd.DataFrame(data = {'User':User, 'Date': Date, 'Flag': Flag})
df['Date'] = pd.to_datetime(df['Date'])
User Date Flag
0 A 4/1/2023 0
1 A 4/2/2023 1
2 A 4/3/2023 1
3 A 4/4/2023 0
4 A 4/5/2023 0
5 B 4/1/2023 0
6 B 4/2/2023 1
7 B 4/4/2023 1
8 B 4/5/2023 1
9 C 4/1/2023 1
10 C 4/2/2023 1
11 C 4/3/2023 0
12 C 4/4/2023 1
13 C 4/5/2023 1
So I used the following coding to group by User and then sum the total for each User and fill in for the following results creating the Tally column:
df['Tally'] = 0
g = df['Flag'].ne(df['Flag'].shift()).cumsum()
counts = df.groupby(['User', g])['Flag'].transform('size')
df['Tally'] = np.where(df['Flag'].eq(1), counts, 0)
User Date Flag Tally
0 A 2023-04-01 0 0
1 A 2023-04-02 1 2
2 A 2023-04-03 1 2
3 A 2023-04-04 0 0
4 A 2023-04-05 0 0
5 B 2023-04-01 0 0
6 B 2023-04-02 1 3
7 B 2023-04-04 1 3
8 B 2023-04-05 1 3
9 C 2023-04-01 1 2
10 C 2023-04-02 1 2
11 C 2023-04-03 0 0
12 C 2023-04-04 1 2
13 C 2023-04-05 1 2
So this is very close to what I want, but if you look at User B is missing 2023-04-03 date so ideally that one should be 4, 4, 4 instead.
To try and compensate for this, I was looking to have my code go through and find the start and end date of each interval so I can have it do the difference of the dates. I used this:
dates = df.groupby(['User', 'Tally']).agg(Start = ('Date', np.min), End = ('Date', np.max))
User Tally Start End
A 0 2023-04-01 2023-04-05
A 2 2023-04-02 2023-04-03
B 0 2023-04-01 2023-04-01
B 3 2023-04-02 2023-04-05
C 0 2023-04-03 2023-04-03
C 2 2023-04-01 2023-04-05
My only issue here is that since User C has TWO intervals that have the same duration, it is counting them both in one "2-day category" and only getting the max and min dates over that entire time period, so it looks like it is a 5 day period duration instead of two 2s. For this case we may be able to logically do the math but for a series of intervals over a much longer timeframe, this is problematic and any intervals in the middle will be lost date-wise (or counted incorrectly if missing dates are included in them). How can I modify my code (or what new things can I try) to have it go through and give me the start and end date of each consecutive interval per User?
Thank you in advance!
PS I tried having it loop and reset if it encounters a 0 value but I am not sure I did it right bc I kept getting errors and it would not return any results. I didn't save my code for those attempts unfortunately.