I am trying to calculate the # of days between failures
. I'd like to know on each day in the series the # of days passed since the last failure
where failure = 1
. There may be anywhere from 1 to 1500 devices.
For Example, Id like my dataframe to look like this (please pull data from url in the second code block. This is just a short example of a larger dataframe.):
date device failure elapsed
10/01/2015 S1F0KYCR 1 0
10/07/2015 S1F0KYCR 1 7
10/08/2015 S1F0KYCR 0 0
10/09/2015 S1F0KYCR 0 0
10/17/2015 S1F0KYCR 1 11
10/31/2015 S1F0KYCR 0 0
10/01/2015 S8KLM011 1 0
10/02/2015 S8KLM011 1 2
10/07/2015 S8KLM011 0 0
10/09/2015 S8KLM011 0 0
10/11/2015 S8KLM011 0 0
10/21/2015 S8KLM011 1 20
Sample Code:
Edit: Please pull actual data from code block below. The above sample data is an short example. Thanks.
url = "https://raw.githubusercontent.com/dsdaveh/device-failure-analysis/master/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True) #Sort by date and device
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d') #format date to datetime
This is where I am running into obstacles. However the new column should contain the # of days since last failure
, where failure = 1
.
test['date'] = 0
for i in test.index[1:]:
if not test['failure'][i]:
test['elapsed'][i] = test['elapsed'][i-1] + 1
I have also tried
fails = df[df.failure==1]
fails.Dates = trues.index #need this because .diff() won't work on the index..
fails.Elapsed = trues.Dates.diff()