-1

I have a question concerning imputation for panel data. In short, I wish to impute a value in years that have missing values based on the other years of the relevant entity. I thus do not want to impute values when I do not have any non-missing data for that entity. the example here uses interpolation but I would like to know how to this with averages as well. The interpolation method that I use seems to impute data for all cases, which makes me doubt that this is done per entity as well. see example below:

 company = ['A'] * 5 + ['B'] * 5 + ['C'] * 5
    year = ['2017', '2018', '2019', '2020', '2021'] * 3
    value = [.3, .32, .33, .35, .37, .2, .21, .22, np.nan, np.nan] + [np.nan] * 5
    df = pd.DataFrame(data = {'company': company, 'year': year, 'value': value})
    
print(df)

   company  year  value
0        A  2017   0.30
1        A  2018   0.32
2        A  2019   0.33
3        A  2020   0.35
4        A  2021   0.37
5        B  2017   0.20
6        B  2018   0.21
7        B  2019   0.22
8        B  2020    NaN
9        B  2021    NaN
10       C  2017    NaN
11       C  2018    NaN
12       C  2019    NaN
13       C  2020    NaN
14       C  2021    NaN

df = df.set_index(['company', 'year'])

df['value2'] = df['value'].interpolate()

print(df)

              value  value2
company year
A       2017   0.30    0.30
        2018   0.32    0.32
        2019   0.33    0.33
        2020   0.35    0.35
        2021   0.37    0.37
B       2017   0.20    0.20
        2018   0.21    0.21
        2019   0.22    0.22
        2020    NaN    0.22
        2021    NaN    0.22
C       2017    NaN    0.22
        2018    NaN    0.22
        2019    NaN    0.22
        2020    NaN    0.22
        2021    NaN    0.22

Instead I would like something like this:

              value  value2
company year
A       2017   0.30    0.30
        2018   0.32    0.32
        2019   0.33    0.33
        2020   0.35    0.35
        2021   0.37    0.37
B       2017   0.20    0.20
        2018   0.21    0.21
        2019   0.22    0.22
        2020    NaN    0.23
        2021    NaN    0.24
C       2017    NaN    NaN
        2018    NaN    NaN
        2019    NaN    NaN
        2020    NaN    NaN
        2021    NaN    NaN

1 Answers1

0

You can interpolate per group:

df['value2'] = df.groupby('company')['value'].apply(lambda g: g.interpolate())

Output:

   company  year  value  value2
0        A  2017   0.30    0.30
1        A  2018   0.32    0.32
2        A  2019   0.33    0.33
3        A  2020   0.35    0.35
4        A  2021   0.37    0.37
5        B  2017   0.20    0.20
6        B  2018   0.21    0.21
7        B  2019   0.22    0.22
8        B  2020    NaN    0.22
9        B  2021    NaN    0.22
10       C  2017    NaN     NaN
11       C  2018    NaN     NaN
12       C  2019    NaN     NaN
13       C  2020    NaN     NaN
14       C  2021    NaN     NaN

What you want to do however rather looks like extrapolation. You can try to use successive differences for that:

df['value2'] = df.groupby('company')['value'].apply(lambda g: g.fillna(g.diff().interpolate().cumsum().add(g.iloc[0])))

Output:

   company  year  value  value2
0        A  2017   0.30    0.30
1        A  2018   0.32    0.32
2        A  2019   0.33    0.33
3        A  2020   0.35    0.35
4        A  2021   0.37    0.37
5        B  2017   0.20    0.20
6        B  2018   0.21    0.21
7        B  2019   0.22    0.22
8        B  2020    NaN    0.23
9        B  2021    NaN    0.24
10       C  2017    NaN     NaN
11       C  2018    NaN     NaN
12       C  2019    NaN     NaN
13       C  2020    NaN     NaN
14       C  2021    NaN     NaN

handle each stretch of non-NA then NA separately

m = df['value'].notna()
group = (m & m.ne(m.shift())).cumsum()

df['value2'] = (df.groupby(['company', group])['value']
                  .apply(lambda g: g.fillna(g.diff().interpolate().cumsum().add(g.iloc[0])))
               )

output:

   company  year  value  value2
0        A  2017   0.30    0.30
1        A  2018   0.32    0.32
2        A  2019   0.33    0.33
3        A  2020   0.35    0.35
4        A  2021   0.37    0.37
5        B  2017   0.20    0.20
6        B  2018   0.21    0.21
7        B  2019   0.22    0.22
8        B  2020    NaN    0.23
9        B  2021    NaN    0.24
10       C  2017    NaN     NaN
11       C  2018    NaN     NaN
12       C  2019    NaN     NaN
13       C  2020    NaN     NaN
14       C  2021    NaN     NaN
15       D  2017    NaN     NaN
16       D  2018    NaN     NaN
17       D  2019   0.40    0.40
18       D  2020   0.42    0.42
19       D  2021    NaN    0.44
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thank you, this works great! I did some testing however and this method doesn't seem to work when the non-missing values are in the middle of the timespan:' `company = ['A'] * 5 + ['B'] * 5 + ['C'] * 5 + ['D'] * 5 year = ['2017', '2018', '2019', '2020', '2021'] * 4 value = [.3, .32, .33, .35, .37, .2, .21, .22, np.nan, np.nan] + [np.nan] * 5 + [np.nan, np.nan, 0.4, 0.42, np.nan] df = pd.DataFrame(data = {'company': company, 'year': year, 'value': value}) df['value2'] = df.groupby('company')['value'].apply(lambda g: g.fillna(g.diff().interpolate().cumsum().add(g.iloc[0]))) print(df)` – business_of_ferrets Oct 04 '22 at 14:51
  • In this case you should group by stretches of contiguous data – mozway Oct 04 '22 at 14:57
  • I'm very sorry, but I don't fully understand what that means – business_of_ferrets Oct 04 '22 at 15:06
  • I can't have a look right now, I'll try to update later – mozway Oct 04 '22 at 15:06