1

I am trying to fill in gaps in data by grouping and then using the trend of the previous data points to predict what the missing values are.

df

Group  Week  Value
B      1     5
B      2     6
B      3     NaN
B      4     NaN
B      5     NaN
B      6     8
B      7     8
B      8     7
B      9     6
B      10    NaN

Which graphically looks like this: Initial df plot

Once the desired function has taken place the dataframe will look like the following:

Group  Week  Value
B      1     5
B      2     6
B      3     7
B      4     8
B      5     9
B      6     8
B      7     8
B      8     7
B      9     6
B      10    5.5

The trend of previous points to find these NaN values is shown graphically here: NaN values calculated

The first three NaN values in this example are found by simply plotting the values 5 and 6, finding the linear equation (y = mx + c) and fitting x as the Week to calculate y. This same process would be carried on for all NaN values

I have tried interpolating (df = df.groupby('Group').apply(lambda group: group.interpolate(method='index')) but this obviously looks at the next valid data point and includes it in the calculation, which I am trying to avoid

May be worth noting that the dataframe I am using has 200,000 rows and 4,000 groups!

lrobertd
  • 13
  • 2
  • @lrobertd Why don't you group the data points according to your requirement. – Mankind_008 Jan 02 '21 at 00:10
  • Welcome to StackOverflow! If I have solved your problem, then make sure to accept as solution by clicking the checkmark next to my answer. Thank you! Please see the following link with help on how to do this: https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – David Erickson Jan 02 '21 at 00:42
  • What’s the error? Always include the error message in a comment or add to your question and refer to it in a comment. Thank you! – David Erickson Jan 03 '21 at 02:11
  • @DavidErickson spline is erroring as some of the groups only contain 1 value - which I understand spline cannot operate on. Is there something I could implement along the lines of the following sudo code: if group size = 1 then 'linear' else 'spline' – lrobertd Jan 03 '21 at 21:32
  • Hi @lrobertd See my updated answer. Kindly accept the solution that solved the original problem by clicking the checkmark next to the answer. You can always open a new StackOverflow question if you need more help and reference back to the ois quesiton. – David Erickson Jan 04 '21 at 18:17
  • Please see here on how to accept an answer if you need help: https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – David Erickson Jan 04 '21 at 18:18

1 Answers1

0

You can create subgroups Series g and pass method="spline" and order=1 to interpolate:

g = df['Value'].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
df['Value'] = (df.groupby(['Group', g])['Value']
                 .apply(lambda x: x.interpolate(method="spline", order=1)))
df
Out[1]: 
  Group  Week  Value
0     B     1    5.0
1     B     2    6.0
2     B     3    7.0
3     B     4    8.0
4     B     5    9.0
5     B     6    8.0
6     B     7    8.0
7     B     8    7.0
8     B     9    6.0
9     B    10    5.5

The intermediary step to get g looks like this.

g = df['Value'].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
g
Out[1]: 
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
5    3.0
6    3.0
7    3.0
8    3.0
9    3.0

The numbers basically just create subgroups. My method was one way to achieve this.


Per your comment, I create a mask m that counts groups that are of size 1. Then, I combine the separate methods using fillna():

df = pd.DataFrame({'Group': {0: 'A',
  1: 'B',
  2: 'B',
  3: 'B',
  4: 'B',
  5: 'B',
  6: 'B',
  7: 'B',
  8: 'B',
  9: 'B'},
 'Week': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
 'Value': {0: 5.0,
  1: 6.0,
  2: np.nan,
  3: np.nan,
  4: np.nan,
  5: 8.0,
  6: 8.0,
  7: 7.0,
  8: 6.0,
  9: np.nan}})
g = df['Value'].iloc[1:].mask(df['Value'].notnull(), df['Value'].isnull().cumsum()).ffill()
m = df.groupby(['Group', g])['Value'].transform('count') > 1
v1 = (df[m].groupby(['Group', g])['Value']
                     .apply(lambda x: x.interpolate(method="spline", order=1)))
v2 = (df.groupby(['Group', g])['Value']
                     .apply(lambda x: x.interpolate(method="index")))
df['Value'] = df['Value'].fillna(v1).fillna(v2)
df
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • This is very helpful - thank you One questions - why use 'spline' method rather than 'index' here? x.interpolate(method="spline", order=1) I had to use 'index' otherwise the code threw an error. – lrobertd Jan 02 '21 at 15:25
  • this is very helpful - thank you One issue- I can't get 'spline' method to work - I had to use 'index' otherwise the code threw an error. @daviderickson – lrobertd Jan 02 '21 at 23:47