Pandas - How do recalculate the estimated time of arrival with updated events

Question

I have a dataframe that contains these events:

                       ID           m1          m2          m3   m4
1   xxxx/xxxxx.0183683234   2019-10-28  2019-11-28  2019-11-30  NaT
2   xxxx/xxxxx.0183679721   2019-11-28  2019-11-28         NaT  NaT
4   xxxx/xxxxx.0183888975   2019-11-20  2019-12-10         NaT  NaT

This events are occuring in a temporal sequence. This means that :

m1< m2< m3< m4 <...< mn

The goal is to estimate m3 and m4 before it actually happens.

To do so, I use a masterdata that gives me the duration between m2 and m3 and m3 and m4.

The expected output is:

            xxxxxxxxxxID            m1          m2          m3   m4    M2_M3          M3_M4    m3_estimated   m4_estimated
1   xxxx/xxxxx.0183683234   2019-10-28  2019-11-28  2019-11-30  NaT   2 days         9 days      2019-11-30     2019-12-09
2   xxxx/xxxxx.0183679721   2019-11-28  2019-11-28         NaT  NaT   2 days         6 days      2019-11-30            NaT
4   xxxx/xxxxx.0183888975   2019-11-20  2019-12-10         NaT  NaT   6 days         1 days      2019-12-16            NaT

I want to recalculate everytime m3 and m4 is not null anymore.

Here are the functions I tried, but they are not really working:

def m3_estimated(df):

        if df['m2']!= None:
            return pd.to_datetime(df['m2']) + df['M2_M3']
        else: 
             None

def m4_estimated(df):

        if df['m3'] != None:
            return pd.to_datetime(df['m3']) + df['M3_M4']
        else:
             None

are your datetime columns always datetime, do you have any other datetime cols in your dataframe? — Umar.H, Dec 11 '19 at 09:35
You should add pd.Timedelta column to pd.DateTime. I.e. `pd.to_datetime(df['m2']) + pd.to_timedelta(df['M2_M3'], unit='D')` if the latter is integer, or you should turn it into integer somehow, if it's not. — Oleg O, Dec 11 '19 at 09:36
Have you checked this stack? https://stackoverflow.com/questions/45307933/how-to-calculate-time-between-events-in-a-pandas — powerPixie, Dec 11 '19 at 09:40
indeed true that the timedelta function is works for this. but my problem is to calculate the estimated when the actual is not null — Haalanam, Dec 11 '19 at 09:40
@Datanovice Yes it's always datetime cols and i have other datetime cols in my dataframe# — Haalanam, Dec 11 '19 at 09:45
it contains the columns m2_m3, m3_m4 that i have added in the dataframe output — Haalanam, Dec 11 '19 at 09:50

powerPixie · Answer 1 · 2019-12-11T13:40:35.947

0

Let's make sure that date columns are datetime type:

for col in df.columns:
    if col != 'ID':
         df[col] = pd.to_datetime(df[col], errors='coerce')

I still don't know how masterdata looks like, but I will assume it has the same row length of your original dataframe. That's my masterdata:

master = pd.DataFrame([['xxxx/xxxxx.0183683234','2 days','9 days'],
                       ['xxxx/xxxxx.0183679721','2 days','6 days'],
                       ['xxxx/xxxxx.0183888975','6 days','1 day']],columns=['ID','M2_M3','M3_M4'])


out = master    
out = out.merge(df, on='ID') #it will the expected output

# variables to new columns
m3_estimated = []
m4_estimated = []

# library to add days in a simple way
from datetime import timedelta  


for li,m2_v in zip(out['M2_M3'].dt.days.astype('int16'),out['m2']):
   if m2_v:
        m3_estimated.append(m2_v + timedelta(days=li))
    else:
        m3_estimated.append(None)


for li,m3_v in zip(out['M3_M4'].dt.days.astype('int16'),out['m3']):
    if m3_v:
        m4_estimated.append(m3_v + timedelta(days=li))
    else:
        m4_estimated.append(None)


out['m3_estimated'] = m3_estimated
out['m4_estimated'] = m4_estimated

print(out)
                      ID   M2_M3   M3_M4  ...  m4 m3_estimated m4_estimated
0  xxxx/xxxxx.0183683234  2 days  9 days  ... NaT   2019-11-30   2019-12-09
1  xxxx/xxxxx.0183679721  2 days  6 days  ... NaT   2019-11-30          NaT
2  xxxx/xxxxx.0183888975  6 days   1 day  ... NaT   2019-12-16          NaT

edited Dec 11 '19 at 13:40

answered Dec 11 '19 at 11:14

powerPixie

718
9
20

i used it and had this error AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas coming from this line: for li,m2_v in zip(out['M2_M3'].str.split(' '),out['m2']): – Haalanam Dec 11 '19 at 11:35
import numpy as np – powerPixie Dec 11 '19 at 11:36
It must be a problem with the data itself. I used the data available in your example. It seems it can be a problem with data in the column 'M2_M3'. Without knowing the reason of the conflict is difficult to help. – powerPixie Dec 11 '19 at 11:39
M2_M3 has the datatype: timedelta64[ns] – Haalanam Dec 11 '19 at 11:45
And null values? – powerPixie Dec 11 '19 at 11:46
there is no null values – Haalanam Dec 11 '19 at 11:48
@Haalanam, I've adjust the code and now it works. In my dataframe out['M2_M3'] was string type. I've tested and produces the same output as before. This is how I deal with timedelta64(ns): out['M3_M4'].dt.days.astype('int16') – powerPixie Dec 11 '19 at 13:36
@Haalanam Does it help you? – powerPixie Dec 13 '19 at 07:19

score 0 · Answer 2 · answered Dec 11 '19 at 12:40

A possible solution:

df
            xxxxxxxxxxID         m1         m2         m3  m4  M2_M3  M3_M4
1  xxxx/xxxxx.0183683234 2019-10-28 2019-11-28 2019-11-30 NaT 2 days 9 days
2  xxxx/xxxxx.0183679721 2019-11-28 2019-11-28        NaT NaT 2 days 6 days
4  xxxx/xxxxx.0183888975 2019-11-20 2019-12-10        NaT NaT 6 days 1 days

df.dtypes                                                                                                             
xxxxxxxxxxID             object
m1               datetime64[ns]
m2               datetime64[ns]
m3               datetime64[ns]
m4               datetime64[ns]
M2_M3           timedelta64[ns]
M3_M4           timedelta64[ns]
dtype: object

#This two lines can be put in a timeloop:
df["m3_estimated"]=df.m3.where(~df.m3.isna(), df.m2.add(df.M2_M3))
df["m4_estimated"]=df.m4.where(~df.m4.isna(), df.m3_estimated.add(df.M3_M4))

df
            xxxxxxxxxxID         m1         m2         m3  m4  M2_M3  M3_M4 m3_estimated m4_estimated
1  xxxx/xxxxx.0183683234 2019-10-28 2019-11-28 2019-11-30 NaT 2 days 9 days   2019-11-30   2019-12-09
2  xxxx/xxxxx.0183679721 2019-11-28 2019-11-28        NaT NaT 2 days 6 days   2019-11-30   2019-12-06
4  xxxx/xxxxx.0183888975 2019-11-20 2019-12-10        NaT NaT 6 days 1 days   2019-12-16   2019-12-17

df.m4.where(...) selects the value of m4 if it is fixed, or calculate it using m3_estimated and M3_M4.

i like this solution and it's working... However i had to add the .dt.normalize() to skip the timestamp value — Haalanam, Dec 12 '19 at 14:56
btw, how can i add the lines in timeloop? what do you mean by that ? — Haalanam, Dec 12 '19 at 14:56
@Haalanam I mean executing periodically a code that contains a data rereading part and those two lines. — kantal, Dec 12 '19 at 17:55

Pandas - How do recalculate the estimated time of arrival with updated events

2 Answers2