I have had a rethink of the issue and have reformulated my question.
I have a dataframe (df
) which has timeseries data for a number of factors. The timeseries for each factor can start on different days which is fine. For some specific days there is missing data (white space) for FactorB and FactorC (in this example 07/01/2017). For FactorB and FactorC with these white-space days I would like to fill the holes with value for that factor from the previous day. For example:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
In df
data is missing for FactorB
and FactorC
on 07/01/2017
. I would like the resulting df
to look like:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62 5.93 5.91
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
I am wondering if I need to specifically change the white space for FactorB and FactorC on the date with the hole in it (in this example 07/01/2017) to NaN before I then apply
df= df.replace('',np.NaN).ffill()
So my intermediate output for the issue would look like:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62 NaN NaN
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
But how would I apply a NaN to only days where I am legitimately missing data (not changing the days before the FactorB and FactorC timeseries started. Also is there a way to do this without specifically calling a date as the holes could be on any date.
I have tried the following but when I check the data the white space is still there and I feel like I'm going no where:
col = ['FactorB', 'FactorC']
df[col] = df[col].ffill()
I've also tried:
df.fillna(method='ffill')
and
df= df.replace('',np.NaN).ffill()