0

I am trying to extrapolate my dataset. A snippet looks as follows. A simple linear extrapolation is fine here:

Index Value
3000  NaN
4000  NaN
5000  10
6000  20
6500  33
7000  44  
8300  60
9300  NaN
9400  NaN

The extrapolation should consider the index values. As the pandas package only provides a function for interpolation, I am stuck. I looked at scipy package, but cant seem to implement my idea. Would really appreciate any help.

1 Answers1

0

I'm more familiar with scikit-learn:

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
df = pd.DataFrame([(3000,  np.nan),
(4000,  np.nan),
(5000,  10),
(6000,  20),
(6500,  33),
(7000,  44  ),
(8300,  60),
(9300,  np.nan),
(9400,  np.nan)], columns=['Index', 'Value'])
def extrapolate(df, X_col, y_col):
    
    df_ = df[[X_col, y_col]].dropna()
    
    return LinearRegression().fit(
        df_[X_col].values.reshape(-1,1), df_[y_col]).predict(
        df[X_col].values.reshape(-1,1))
df['Value_'] = extrapolate(df, 'Index', 'Value')
df

You should obtain something like this:

    Index   Value   Value_
0   3000    NaN     -23.219022
1   4000    NaN     -7.314802
2   5000    10.0    8.589417
3   6000    20.0    24.493637
4   6500    33.0    32.445747
5   7000    44.0    40.397857
6   8300    60.0    61.073342
7   9300    NaN     76.977562
8   9400    NaN     78.567984
# I assume you don't want to extrapolate the orginal values
df['Value'] = df['Value'].fillna(df['Value_'])
df

Gives:

    Index   Value   Value_
0   3000    -23.219022  -23.219022
1   4000    -7.314802   -7.314802
2   5000    10.000000   8.589417
3   6000    20.000000   24.493637
4   6500    33.000000   32.445747
5   7000    44.000000   40.397857
6   8300    60.000000   61.073342
7   9300    76.977562   76.977562
8   9400    78.567984   78.567984
dokteurwho
  • 321
  • 2
  • 6