0

I have a dataframe that contains these events:

                       ID           m1          m2          m3   m4
1   xxxx/xxxxx.0183683234   2019-10-28  2019-11-28  2019-11-30  NaT
2   xxxx/xxxxx.0183679721   2019-11-28  2019-11-28         NaT  NaT
4   xxxx/xxxxx.0183888975   2019-11-20  2019-12-10         NaT  NaT

This events are occuring in a temporal sequence. This means that :

m1< m2< m3< m4 <...< mn

The goal is to estimate m3 and m4 before it actually happens.

To do so, I use a masterdata that gives me the duration between m2 and m3 and m3 and m4.

The expected output is:

            xxxxxxxxxxID            m1          m2          m3   m4    M2_M3          M3_M4    m3_estimated   m4_estimated
1   xxxx/xxxxx.0183683234   2019-10-28  2019-11-28  2019-11-30  NaT   2 days         9 days      2019-11-30     2019-12-09
2   xxxx/xxxxx.0183679721   2019-11-28  2019-11-28         NaT  NaT   2 days         6 days      2019-11-30            NaT
4   xxxx/xxxxx.0183888975   2019-11-20  2019-12-10         NaT  NaT   6 days         1 days      2019-12-16            NaT

I want to recalculate everytime m3 and m4 is not null anymore.

Here are the functions I tried, but they are not really working:

def m3_estimated(df):

        if df['m2']!= None:
            return pd.to_datetime(df['m2']) + df['M2_M3']
        else: 
             None

def m4_estimated(df):

        if df['m3'] != None:
            return pd.to_datetime(df['m3']) + df['M3_M4']
        else:
             None 
Nick Decroos
  • 153
  • 1
  • 8
Haalanam
  • 89
  • 9
  • are your datetime columns always datetime, do you have any other datetime cols in your dataframe? – Umar.H Dec 11 '19 at 09:35
  • You should add pd.Timedelta column to pd.DateTime. I.e. `pd.to_datetime(df['m2']) + pd.to_timedelta(df['M2_M3'], unit='D')` if the latter is integer, or you should turn it into integer somehow, if it's not. – Oleg O Dec 11 '19 at 09:36
  • Have you checked this stack? https://stackoverflow.com/questions/45307933/how-to-calculate-time-between-events-in-a-pandas – powerPixie Dec 11 '19 at 09:40
  • indeed true that the timedelta function is works for this. but my problem is to calculate the estimated when the actual is not null – Haalanam Dec 11 '19 at 09:40
  • @powerPixie yes i checked but it's not what i want – Haalanam Dec 11 '19 at 09:43
  • @Datanovice Yes it's always datetime cols and i have other datetime cols in my dataframe# – Haalanam Dec 11 '19 at 09:45
  • @Haalanam. Ok. How does masterdata look? – powerPixie Dec 11 '19 at 09:46
  • it contains the columns m2_m3, m3_m4 that i have added in the dataframe output – Haalanam Dec 11 '19 at 09:50

2 Answers2

0

Let's make sure that date columns are datetime type:

for col in df.columns:
    if col != 'ID':
         df[col] = pd.to_datetime(df[col], errors='coerce')

I still don't know how masterdata looks like, but I will assume it has the same row length of your original dataframe. That's my masterdata:

master = pd.DataFrame([['xxxx/xxxxx.0183683234','2 days','9 days'],
                       ['xxxx/xxxxx.0183679721','2 days','6 days'],
                       ['xxxx/xxxxx.0183888975','6 days','1 day']],columns=['ID','M2_M3','M3_M4'])


out = master    
out = out.merge(df, on='ID') #it will the expected output

# variables to new columns
m3_estimated = []
m4_estimated = []

# library to add days in a simple way
from datetime import timedelta  


for li,m2_v in zip(out['M2_M3'].dt.days.astype('int16'),out['m2']):
   if m2_v:
        m3_estimated.append(m2_v + timedelta(days=li))
    else:
        m3_estimated.append(None)


for li,m3_v in zip(out['M3_M4'].dt.days.astype('int16'),out['m3']):
    if m3_v:
        m4_estimated.append(m3_v + timedelta(days=li))
    else:
        m4_estimated.append(None)


out['m3_estimated'] = m3_estimated
out['m4_estimated'] = m4_estimated

print(out)
                      ID   M2_M3   M3_M4  ...  m4 m3_estimated m4_estimated
0  xxxx/xxxxx.0183683234  2 days  9 days  ... NaT   2019-11-30   2019-12-09
1  xxxx/xxxxx.0183679721  2 days  6 days  ... NaT   2019-11-30          NaT
2  xxxx/xxxxx.0183888975  6 days   1 day  ... NaT   2019-12-16          NaT
powerPixie
  • 718
  • 9
  • 20
0

A possible solution:

df
            xxxxxxxxxxID         m1         m2         m3  m4  M2_M3  M3_M4
1  xxxx/xxxxx.0183683234 2019-10-28 2019-11-28 2019-11-30 NaT 2 days 9 days
2  xxxx/xxxxx.0183679721 2019-11-28 2019-11-28        NaT NaT 2 days 6 days
4  xxxx/xxxxx.0183888975 2019-11-20 2019-12-10        NaT NaT 6 days 1 days

df.dtypes                                                                                                             
xxxxxxxxxxID             object
m1               datetime64[ns]
m2               datetime64[ns]
m3               datetime64[ns]
m4               datetime64[ns]
M2_M3           timedelta64[ns]
M3_M4           timedelta64[ns]
dtype: object

#This two lines can be put in a timeloop:
df["m3_estimated"]=df.m3.where(~df.m3.isna(), df.m2.add(df.M2_M3))
df["m4_estimated"]=df.m4.where(~df.m4.isna(), df.m3_estimated.add(df.M3_M4))

df
            xxxxxxxxxxID         m1         m2         m3  m4  M2_M3  M3_M4 m3_estimated m4_estimated
1  xxxx/xxxxx.0183683234 2019-10-28 2019-11-28 2019-11-30 NaT 2 days 9 days   2019-11-30   2019-12-09
2  xxxx/xxxxx.0183679721 2019-11-28 2019-11-28        NaT NaT 2 days 6 days   2019-11-30   2019-12-06
4  xxxx/xxxxx.0183888975 2019-11-20 2019-12-10        NaT NaT 6 days 1 days   2019-12-16   2019-12-17

df.m4.where(...) selects the value of m4 if it is fixed, or calculate it using m3_estimated and M3_M4.

kantal
  • 2,331
  • 2
  • 8
  • 15