0

I have had a rethink of the issue and have reformulated my question.

I have a dataframe (df) which has timeseries data for a number of factors. The timeseries for each factor can start on different days which is fine. For some specific days there is missing data (white space) for FactorB and FactorC (in this example 07/01/2017). For FactorB and FactorC with these white-space days I would like to fill the holes with value for that factor from the previous day. For example:

            FactorA FactorB FactorC
01/01/2017     5.50    
02/01/2017     5.31    
03/01/2017     5.62     
04/01/2017     5.84    5.62    5.74
05/01/2017     5.95    5.85    5.86
06/01/2017     5.94    5.93    5.91
07/01/2017     5.62     
08/01/2017     6.01    6.20    6.21
09/01/2017     6.12    6.20    3.23

In df data is missing for FactorB and FactorC on 07/01/2017. I would like the resulting df to look like:

           FactorA  FactorB FactorC
01/01/2017    5.50     
02/01/2017    5.31     
03/01/2017    5.62     
04/01/2017    5.84     5.62    5.74
05/01/2017    5.95     5.85    5.86
06/01/2017    5.94     5.93    5.91
07/01/2017    5.62     5.93    5.91
08/01/2017    6.01     6.20    6.21
09/01/2017    6.12     6.20    3.23

I am wondering if I need to specifically change the white space for FactorB and FactorC on the date with the hole in it (in this example 07/01/2017) to NaN before I then apply

df= df.replace('',np.NaN).ffill() 

So my intermediate output for the issue would look like:

            FactorA FactorB FactorC
01/01/2017     5.50    
02/01/2017     5.31    
03/01/2017     5.62     
04/01/2017     5.84    5.62    5.74
05/01/2017     5.95    5.85    5.86
06/01/2017     5.94    5.93    5.91
07/01/2017     5.62     NaN     NaN
08/01/2017     6.01    6.20    6.21
09/01/2017     6.12    6.20    3.23

But how would I apply a NaN to only days where I am legitimately missing data (not changing the days before the FactorB and FactorC timeseries started. Also is there a way to do this without specifically calling a date as the holes could be on any date.

I have tried the following but when I check the data the white space is still there and I feel like I'm going no where:

col = ['FactorB', 'FactorC']
df[col] = df[col].ffill() 

I've also tried:

df.fillna(method='ffill')

and

df= df.replace('',np.NaN).ffill() 
halfer
  • 19,824
  • 17
  • 99
  • 186
Stacey
  • 4,825
  • 17
  • 58
  • 99
  • 1
    Use `df = df.ffill()` – jezrael Jun 25 '18 at 13:25
  • Thanks jezrael I tried your fix but nothing changes (I still have the blanks instead of yesterdays values) – Stacey Jun 25 '18 at 20:05
  • Are data confidental? If not, is possible send it to my email in my profile? Because is seems data related issue. – jezrael Jun 25 '18 at 20:22
  • Does it make a difference if the missing values are blank as oppose to NaN - The missing values are blank in the dataframe – Stacey Jun 25 '18 at 21:37
  • What about `df= df.replace('\s+',np.NaN, regex=True).ffill() ` ? – jezrael Jun 26 '18 at 08:54
  • Or better `df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))` for replace all non numeric to NaNs – jezrael Jun 26 '18 at 08:57
  • Thanks jezrael much appreciated, I'm away from my machine but looking at the code df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')) would this turn any fields for FactorB and FactorC before 04/01/2017 to NaN as well (ths would not work for the problem) – Stacey Jun 26 '18 at 09:23
  • Yes, it is expected behaviour, get NaNs for all non numeric, or not? – jezrael Jun 26 '18 at 09:25
  • Thanks, The timeseries for the factors can start on different days so before the first day of the start of actual values I'd like the timeseries to remain white-space( so before 04/01/2017). Maybe is there a way to check if a white space field has a numeric value for the day before (if so set to NaN)? – Stacey Jun 26 '18 at 09:44
  • I think `ffill` first NaNs values not change, so `df = df.replace('',np.NaN).ffill().fillna('')` should working nice – jezrael Jun 26 '18 at 10:56

1 Answers1

0

If some values are missing and not NaN:

df = df.replace('',np.NaN).ffill()
Joe
  • 12,057
  • 5
  • 39
  • 55