2

I am working with the following Dataframe that has some NaN values inside.

df = pd.DataFrame({'day':[pd.datetime(2020,1,1),pd.datetime(2020,1,3),pd.datetime(2020,1,4),pd.datetime(2020,1,5),pd.datetime(2020,1,6),pd.datetime(2020,1,7),pd.datetime(2020,1,8),pd.datetime(2020,1,8),pd.datetime(2020,6,9)],
                   'TradeID':['01','02','03','04','05','06','07','08','09'],
                   'Security': ['GOOGLE', 'GOOGLE', 'APPLE', 'GOOGLE', 'GOOGLE','GOOGLE','GOOGLE','GOOGLE','GOOGLE'], 
                   'ID': ['ID001', 'ID001', 'ID001', 'ID001', 'ID001','ID001','ID001','ID001','ID001'], 
                   'BSType': ['B', 'S', 'B', 'B', 'B','S','S','S','B'], 
                   'Price':[105.901,106.969,np.nan,107.037,107.038,107.136,np.nan,107.25,np.nan],
                   'Quantity':[1000000,-300000,np.nan,7500000,100000,-100000,np.nan,-7800000,np.nan]
                  })

Out[318]: 
         day TradeID Security     ID BSType    Price   Quantity
0 2020-01-01      01   GOOGLE  ID001      B  105.901  1000000.0
1 2020-01-03      02   GOOGLE  ID001      S  106.969  -300000.0
2 2020-01-04      03    APPLE  ID001      B      NaN        NaN
3 2020-01-05      04   GOOGLE  ID001      B  107.037  7500000.0
4 2020-01-06      05   GOOGLE  ID001      B  107.038   100000.0
5 2020-01-07      06   GOOGLE  ID001      S  107.136  -100000.0
6 2020-01-08      07   GOOGLE  ID001      S      NaN        NaN
7 2020-01-08      08   GOOGLE  ID001      S  107.250 -7800000.0
8 2020-06-09      09   GOOGLE  ID001      B      NaN        NaN

My goal is to fillna with the method ffill only for the same Security, same ID and limited for the next 60 days (not the next 60 observations, because there may be more than one observation per day).

Here is what i tried but is not working, it does not replace any of my NaN values

df=df.groupby(['day',"Security","ID"], as_index=False).fillna(method='ffill',limit=60)

The expected output should look like this: (Note that only the second pair of NaN values have been filled)

  • The first pair of NaN values should not be filled because is not the same Security.
  • The second pair of NaN values should be filled with the previous observation.
  • The third pair on NaN should not be filled because they are out of the 60 days scope.
Out[320]: 
         day TradeID Security     ID BSType    Price   Quantity
0 2020-01-01      01   GOOGLE  ID001      B  105.901  1000000.0
1 2020-01-03      02   GOOGLE  ID001      S  106.969  -300000.0
2 2020-01-04      03    APPLE  ID001      B      NaN        NaN
3 2020-01-05      04   GOOGLE  ID001      B  107.037  7500000.0
4 2020-01-06      05   GOOGLE  ID001      B  107.038   100000.0
5 2020-01-07      06   GOOGLE  ID001      S  107.136  -100000.0
6 2020-01-08      07   GOOGLE  ID001      S  107.136  -100000.0
7 2020-01-08      08   GOOGLE  ID001      S  107.250 -7800000.0
8 2020-06-09      09   GOOGLE  ID001      B      NaN        NaN

So, my question is, ¿is there a plausible way to fill NaN values limiting the ffill method on a certain period?

Thank you very much for you time.

2 Answers2

1

You can group the dataframe on columns Security and ID along with an additional grouper for column day with frequency set to 60 days then use ffill to forward fill the values for the next 60 days:

g = pd.Grouper(key='day', freq='60d')
df.assign(**df.groupby(["Security","ID", g]).ffill())

         day TradeID Security     ID BSType    Price   Quantity
0 2020-01-01      01   GOOGLE  ID001      B  105.901  1000000.0
1 2020-01-03      02   GOOGLE  ID001      S  106.969  -300000.0
2 2020-01-04      03    APPLE  ID001      B      NaN        NaN
3 2020-01-05      04   GOOGLE  ID001      B  107.037  7500000.0
4 2020-01-06      05   GOOGLE  ID001      B  107.038   100000.0
5 2020-01-07      06   GOOGLE  ID001      S  107.136  -100000.0
6 2020-01-08      07   GOOGLE  ID001      S  107.136  -100000.0
7 2020-01-08      08   GOOGLE  ID001      S  107.250 -7800000.0
8 2020-06-09      09   GOOGLE  ID001      B      NaN        NaN
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    Hi, thank you very much for this usefull tip. It is working very well. What abou if i want to limit this command only to a certain column, i.e. ffill NaN values with this criteria only on one column. I have tried df.assign(**df.groupby(["Security","ID", g])["Quantity"].ffill()) but it is not working. – Guillermo Cambronero Pérez Feb 25 '21 at 10:09
  • 1
    @GuillermoCambroneroPérez If you want to forward fill any particular column then you can use `df.assign(**df.groupby(["Security","ID", g])[['Quantity']].ffill())` or `df.assign(**df.groupby(["Security","ID", g])['Quantity'].ffill().to_frame())` – Shubham Sharma Feb 25 '21 at 10:35
  • Hi @ShubhamSharma, I have tried your method but the freq does not seems quite correct. Can you please help to look into it? `g = pd.Grouper(key='date', freq='180d') df5 = df5.assign(**df5.groupby(['customer_id', g]).ffill())` ` – Danny Mar 10 '23 at 09:26
  • @ShubhamSharma, for the same id I have, for 2 dates within 180 days, but the amount didn't populate to the NaN value. – Danny Mar 10 '23 at 09:35
  • @Danny Its hard to guess with looking at the sample of dataset and code you are trying. Can you provide the sample? – Shubham Sharma Mar 10 '23 at 14:05
  • @ShubhamSharma please refer to below sample. It's basically the same questiona as this post but I have many different IDs and date, just some records are able to populate correctly, some are just still missing. – Danny Mar 11 '23 at 07:52
0

Here's my attempt, not sure if this is particularly scalable though:

filled_df = df.groupby(["Security","ID"], as_index=False).fillna(method='ffill')
diffs = df.groupby(["Security","ID"])["day"].diff().dt.days
df["diffs"] = diffs
df["price_isna"] = df["Price"].isna()
df["quantity_isna"] = df["Quantity"].isna()
df = df.drop(columns=["Price", "Quantity"]).merge(filled_df, on=["day", "TradeID", "BSType"])

def reverse_fillna(value, value_isna, diffs, time_limit=60):
    if (value_isna and (diffs <= time_limit)) or (not value_isna):
        return value
    else:
        return np.nan
    
df['Price'] = df.apply(lambda row: reverse_fillna(row['Price'], row['price_isna'], row['diffs']), axis=1)
df['Quantity'] = df.apply(lambda row: reverse_fillna(row['Quantity'], row['quantity_isna'], row['diffs']), axis=1)

df.drop(columns=["price_isna", "quantity_isna", "diffs"], inplace=True)
ap1997
  • 183
  • 6