0

I want to track intervals of consecutive days where certain conditions are met over a timeframe where each User may have multiple occurrences (potential gaps in between) and each User may be missing some Date entries as well. I will show my process so far up to where I am stuck, but if there is a different/easier way to do the end goal, I am not attached to my beginning steps at all.

(The logic I am following is if the day before a missing entry is flagged and the day after a missing entry is flagged, we want to consider the missing date also flagged and calculate the total consecutive days difference.)

Here is an example DataFrame for what I am trying to do:

User = ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C']
Date = ['4/1/2023', '4/2/2023', '4/3/2023', '4/4/2023', '4/5/2023', '4/1/2023', '4/2/2023', '4/4/2023', '4/5/2023', '4/1/2023', '4/2/2023', '4/3/2023', '4/4/2023', '4/5/2023']
Flag = [0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1]

df = pd.DataFrame(data = {'User':User, 'Date': Date, 'Flag': Flag})
df['Date'] = pd.to_datetime(df['Date'])



   User   Date     Flag
0   A   4/1/2023    0
1   A   4/2/2023    1
2   A   4/3/2023    1
3   A   4/4/2023    0
4   A   4/5/2023    0
5   B   4/1/2023    0
6   B   4/2/2023    1
7   B   4/4/2023    1
8   B   4/5/2023    1
9   C   4/1/2023    1
10  C   4/2/2023    1
11  C   4/3/2023    0
12  C   4/4/2023    1
13  C   4/5/2023    1

So I used the following coding to group by User and then sum the total for each User and fill in for the following results creating the Tally column:

df['Tally'] = 0

g = df['Flag'].ne(df['Flag'].shift()).cumsum()
counts = df.groupby(['User', g])['Flag'].transform('size')
df['Tally'] = np.where(df['Flag'].eq(1), counts, 0)


   User   Date    Flag Tally
0   A   2023-04-01  0   0
1   A   2023-04-02  1   2
2   A   2023-04-03  1   2
3   A   2023-04-04  0   0
4   A   2023-04-05  0   0
5   B   2023-04-01  0   0
6   B   2023-04-02  1   3
7   B   2023-04-04  1   3
8   B   2023-04-05  1   3
9   C   2023-04-01  1   2
10  C   2023-04-02  1   2
11  C   2023-04-03  0   0
12  C   2023-04-04  1   2
13  C   2023-04-05  1   2

So this is very close to what I want, but if you look at User B is missing 2023-04-03 date so ideally that one should be 4, 4, 4 instead.

To try and compensate for this, I was looking to have my code go through and find the start and end date of each interval so I can have it do the difference of the dates. I used this:

dates = df.groupby(['User', 'Tally']).agg(Start = ('Date', np.min), End = ('Date', np.max))


User Tally  Start      End
 A   0   2023-04-01  2023-04-05
 A   2   2023-04-02  2023-04-03
 B   0   2023-04-01  2023-04-01
 B   3   2023-04-02  2023-04-05
 C   0   2023-04-03  2023-04-03
 C   2   2023-04-01  2023-04-05

My only issue here is that since User C has TWO intervals that have the same duration, it is counting them both in one "2-day category" and only getting the max and min dates over that entire time period, so it looks like it is a 5 day period duration instead of two 2s. For this case we may be able to logically do the math but for a series of intervals over a much longer timeframe, this is problematic and any intervals in the middle will be lost date-wise (or counted incorrectly if missing dates are included in them). How can I modify my code (or what new things can I try) to have it go through and give me the start and end date of each consecutive interval per User?

Thank you in advance!
PS I tried having it loop and reset if it encounters a 0 value but I am not sure I did it right bc I kept getting errors and it would not return any results. I didn't save my code for those attempts unfortunately.

RiverVal
  • 1
  • 2

0 Answers0