2

I am looking to find an efficient way to generate the number of days diff from an initial point in time. The initial point in time the first time (min date) an ID is triggered by the value Y in the Indicator column.

Input

ID   Indicator  Date
111  Y          1/20/2020
111  N          1/23/2020
111  Y          1/26/2020
111  N          1/26/2020
123  N          1/1/2020
123  Y          1/23/2020
123  Y          2/5/2020

Output

ID   Indicator  Date         Daysdiff
111  Y          1/20/2020    0
111  N          1/23/2020    3
111  Y          1/26/2020    6
111  N          1/26/2020    6
123  N          1/1/2020     -22
123  Y          1/23/2020    0 
123  Y          2/5/2020     12
df['Date'] = pd.to_datetime(df['Date'])
df['Daysdiff'] = df.groupby('ID')['Date'].diff().fillna('')

My difficulty is figuring out how to implement the starting point, where day 0 should be when an ID has an Indicator of Y at their earliest point.

Logan
  • 293
  • 3
  • 11

2 Answers2

2

Let's try with groupby().idxmax:

idx = df['Indicator'].eq('Y').groupby(df['ID']).transform('idxmax')
df['Daysdiff'] = df['Date'] - df.loc[idx, 'Date'].values

Output:

    ID Indicator       Date Daysdiff
0  111         Y 2020-01-20   0 days
1  111         N 2020-01-23   3 days
2  111         Y 2020-01-26   6 days
3  111         N 2020-01-26   6 days
4  123         N 2020-01-01 -22 days
5  123         Y 2020-01-23   0 days
6  123         Y 2020-02-05  13 days

Note that since 'Y' > 'N', the following should have worked, but it doesn't on my system:

idx = df['Indicator'].groupby(df['ID']).transform('idxmax')
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Quick question, would you happen to know why I would receive the following error " unsupported operand type(s) for -: 'str' and 'str'"? It occurs when I try to run your second line of code - thanks! – Logan Jan 07 '21 at 05:04
  • 1
    Your date Column is still string. Did you forget `pd.to_datetime` as in your question? – Quang Hoang Jan 07 '21 at 05:06
  • Thank you and sorry...I did forget that! – Logan Jan 07 '21 at 05:08
1

You can also sort, and then get the first date with groupby. Then get the difference:

first = (df.sort_values(['ID', 'Indicator', 'Date'], ascending=[True,False,True])
           .groupby('ID')['Date'].transform('first'))
df['Daysdiff'] = (df['Date'] - first).dt.days
df
Out[1]: 
    ID Indicator       Date  Daysdiff
0  111         Y 2020-01-20         0
1  111         N 2020-01-23         3
2  111         Y 2020-01-26         6
3  111         N 2020-01-26         6
4  123         N 2020-01-01       -22
5  123         Y 2020-01-23         0
6  123         Y 2020-02-05        13
David Erickson
  • 16,433
  • 2
  • 19
  • 35