0

So I am trying to forward fill a column with the limit being the value in another column. This is the code I run and I get this error message.

import pandas as pd
import numpy as np

df = pd.DataFrame()

df['NM'] = [0, 0, 1, np.nan, np.nan, np.nan, 0]

df['length'] = [0, 0, 2, 0, 0, 0, 0]

print(df)

   NM      length
0  0.0       0
1  0.0       0
2  1.0       2
3  NaN       0
4  NaN       0
5  NaN       0
6  0.0       0

df['NM'] = df['NM'].fillna(method='ffill', limit=df['length'])

print(df)

ValueError: Limit must be an integer

The dataframe I want looks like this:

       NM      length
    0  0.0       0
    1  0.0       0
    2  1.0       2
    3  1.0       0
    4  1.0       0
    5  NaN       0
    6  0.0       0

Thanks in advance for any help you can provide!

noahdfrey
  • 5
  • 3
  • Is there only one sequence of `NaN`s in the column or can there be multiple? If it's only one, you could just set `limit=df['length'].max()`. – fsimonjetz Sep 02 '22 at 20:56
  • No, there can be multiple. The idea is to apply this to a large dataframe with ~50,000 rows – noahdfrey Sep 06 '22 at 19:49

2 Answers2

0

I do not think you want to use ffill for this instance.

Rather I would recommend filtering to where length is greater than 0, then iterating through those rows to enter the NM value from that row in the next n+length rows.

for row in df.loc[df.length.gt(0)].reset_index().to_dict(orient='records'):
    df.loc[row['index']+1:row['index']+row['length'], 'NM'] = row['NM']

To better break this down:

  1. Get rows containing change information be sure to include the index.

    df.loc[df.length.gt(0)].reset_index().to_dict(orient='records')

  2. iterate through them... I prefer to_dict for performance reasons on large datasets. It is a habit.

  3. sets NM rows to the NM value of your row with the defined length.

ak_slick
  • 1,006
  • 6
  • 19
0

You can first group the dataframe by the length column before filling. Only issue is that for the first group in your example limit would be 0 which causes an error, so we can make sure it's at least 1 with max. This might cause unexpected results if there are nan values before the first non-zero value in length but from the given data it's not clear if that can happen.

# make groups
m = df.length.gt(0).cumsum()

# fill the column
df["NM"] = df.groupby(m).apply(
                    lambda f: f.NM.fillna(
                    method="ffill", 
                    limit=max(f.length.iloc[0], 1))
).values
fsimonjetz
  • 5,644
  • 3
  • 5
  • 21