7

I am trying to calculate the # of days between failures. I'd like to know on each day in the series the # of days passed since the last failure where failure = 1. There may be anywhere from 1 to 1500 devices.

For Example, Id like my dataframe to look like this (please pull data from url in the second code block. This is just a short example of a larger dataframe.):

date        device      failure      elapsed    
10/01/2015  S1F0KYCR    1            0           
10/07/2015  S1F0KYCR    1            7           
10/08/2015  S1F0KYCR    0            0           
10/09/2015  S1F0KYCR    0            0           
10/17/2015  S1F0KYCR    1            11          
10/31/2015  S1F0KYCR    0            0           
10/01/2015  S8KLM011    1            0           
10/02/2015  S8KLM011    1            2           
10/07/2015  S8KLM011    0            0
10/09/2015  S8KLM011    0            0
10/11/2015  S8KLM011    0            0
10/21/2015  S8KLM011    1            20 

Sample Code:

Edit: Please pull actual data from code block below. The above sample data is an short example. Thanks.

url = "https://raw.githubusercontent.com/dsdaveh/device-failure-analysis/master/device_failure.csv"

df = pd.read_csv(url, encoding = "ISO-8859-1")

df = df.sort_values(by = ['date', 'device'], ascending = True) #Sort by date and device
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d') #format date to datetime

This is where I am running into obstacles. However the new column should contain the # of days since last failure, where failure = 1.

test['date'] = 0
for i in test.index[1:]:
    if not test['failure'][i]:
        test['elapsed'][i] = test['elapsed'][i-1] + 1

I have also tried

fails = df[df.failure==1]
fails.Dates = trues.index #need this because .diff() won't work on the index..
fails.Elapsed = trues.Dates.diff()
Starbucks
  • 1,448
  • 3
  • 21
  • 49

2 Answers2

5

Using pandas.DataFrame.groupby with diff and apply:

import pandas as pd
import numpy as np

df['date'] = pd.to_datetime(df['date'])
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)

Output:

         Date    Device  Failure  Elapsed
0  2015-10-01  S1F0KYCR        1      0.0
1  2015-10-07  S1F0KYCR        1      7.0
2  2015-10-08  S1F0KYCR        0      0.0
3  2015-10-09  S1F0KYCR        0      0.0
4  2015-10-17  S1F0KYCR        1     11.0
5  2015-10-31  S1F0KYCR        0      0.0
6  2015-10-01  S8KLM011        1      0.0
7  2015-10-02  S8KLM011        1      2.0
8  2015-10-07  S8KLM011        0      0.0
9  2015-10-09  S8KLM011        0      0.0
10 2015-10-11  S8KLM011        0      0.0
11 2015-10-21  S8KLM011        1     20.0

Update:

Found out the actual data linked in the OP contains No device that has more than two failure cases, making the final result all zeros (i.e. no second failure has ever happened and thus nothing to calculate for elapsed). Using OP's original snippet:

import pandas as pd

url = "http://aws-proserve-data-science.s3.amazonaws.com/device_failure.csv"

df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True) 
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d')

Find if any device has more than 1 failure:

df.groupby(['device'])['failure'].sum().gt(1).any()
# False

Which actually confirms that the all zeros in df['elapsed'] is actually a correct answer :)

If you tweak your data a bit, it does yield elapsed just as expected.

df.loc[6879, 'device'] = 'S1F0RRB1'
# Making two occurrence of failure for device S1F0RRB1

s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
df['elapsed'].value_counts()
# 0.0    124493
# 3.0         1
Chris
  • 29,127
  • 3
  • 28
  • 51
  • You can shorten this further and dump apply: `df.groupby(['Device', 'Failure'])['Date'].diff().dt.days.where(df['Failure'].astype(bool)).add(1).fillna(0, downcast='infer')` – cs95 May 21 '19 at 02:34
  • Both of these suggestions create an elapsed column of zeros. No date differences are indicated in the column for when a failure occurs. Its just a column of zeros. – Starbucks May 21 '19 at 02:38
  • @Starbucks forgot to add the to_datetime part. Pleae give another try ;) – Chris May 21 '19 at 02:42
  • @Chris Thanks Chris, but I've already identified it as a datetime variable in the first code block. Did you try pulling in the data in the first snippet? I'm still getting all zeros for the elapsed column. Thank you! – Starbucks May 21 '19 at 02:46
  • `elapsed` gives me a bunch of `0`s, why? – U13-Forward May 21 '19 at 03:02
  • @U9-Forward I've updated some analysis on the dataset. Please correct me if there's any loophole :) – Chris May 21 '19 at 04:35
  • @Starbucks Please check the updated answer and let me know if there's anything I've missed. – Chris May 21 '19 at 04:38
  • @cs95 Thanks for the advice :) I've changed my original answer a bit according to your recommendation – Chris May 21 '19 at 04:50
0

Here is one way

df['elapsed']=df[df.Failure.astype(bool)].groupby('Device').Date.diff().dt.days.add(1)
df.elapsed.fillna(0,inplace=True)
df
Out[225]: 
         Date    Device  Failure  Elapsed  elapsed
0  2015-10-01  S1F0KYCR        1        0      0.0
1  2015-10-07  S1F0KYCR        1        7      7.0
2  2015-10-08  S1F0KYCR        0        0      0.0
3  2015-10-09  S1F0KYCR        0        0      0.0
4  2015-10-17  S1F0KYCR        1       11     11.0
5  2015-10-31  S1F0KYCR        0        0      0.0
6  2015-10-01  S8KLM011        1        0      0.0
7  2015-10-02  S8KLM011        1        2      2.0
8  2015-10-07  S8KLM011        0        0      0.0
9  2015-10-09  S8KLM011        0        0      0.0
10 2015-10-11  S8KLM011        0        0      0.0
11 2015-10-21  S8KLM011        1       20     20.0
BENY
  • 317,841
  • 20
  • 164
  • 234