3

I have to develop a data snooping function in which I divide my dataset, which is in a pandas dataframe, in smaller parts and interpolate there a cubic spline (which are the Z- Values in my case). For this I used the interpolate.interp2d function from scipy Then I subtract my given height values from the estimated ones to get the residuals, where I can apply a 3 sigma threshold and I drop the row with the highest outlier.

For a better understanding here is a 3D plot of my dataset with outlier: enter image description here

But when I run my code I get the following warnings:

/home/mattes/anaconda3/lib/python3.6/site-packages/scipy/interpolate
/_fitpack_impl.py:976: RuntimeWarning: The maximal number of iterations 
(20) allowed for finding smoothing
spline with fp=s has been reached. Probable cause: s too small.
(abs(fp-s)/s>0.001)
    kx,ky=3,3 nx,ny=18,21 m=915 fp=221.198171 s=0.000000
 warnings.warn(RuntimeWarning(_iermess2[ierm][0] + _mess))

/home/mattes/anaconda3/lib/python3.6/site-packages/scipy/interpolate
/_fitpack_impl.py:976: RuntimeWarning: A theoretically impossible result 
when finding a smoothing spline
with fp = s. Probable causes: s too small or badly chosen eps.
(abs(fp-s)/s>0.001)
    kx,ky=3,3 nx,ny=19,22 m=914 fp=209.480429 s=0.000000
warnings.warn(RuntimeWarning(_iermess2[ierm][0] + _mess))

/home/mattes/anaconda3/lib/python3.6/site-packages/scipy/interpolate
/_fitpack_impl.py:976: RuntimeWarning: No more knots can be added because 
the number of B-spline
coefficients already exceeds the number of data points m.
Probable causes: either s or m too small. (fp>s)
    kx,ky=3,3 nx,ny=26,46 m=911 fp=158.754387 s=0.000000
warnings.warn(RuntimeWarning(_iermess2[ierm][0] + _mess))

This is the code I created:

def DataSnooping_splines(df, sigma_size, step_size):


s = sigma_size
Orbit_No = df.index
df = df.assign(orbitNo = Orbit_No)
df = df.sort_values(by=['chainage_km'])
df.index = pd.RangeIndex(len(df.index))

start_pos = df['chainage_km'].values[0]
end_pos = int(df['chainage_km'].values[len(df['chainage_km'].values)-1])
start_points = np.arange(start_pos, end_pos, step_size)

for i in start_points:

    dfn = df.sort_values(by=['chainage_km'])
    dfn = dfn[(dfn['chainage_km']>= i) & (dfn['chainage_km']<=i+step_size)]
   # dfn.index = pd.RangeIndex(len(dfn.index))

    # extract the needed values from the dataframe:
    time = dfn['daytime'].values
    chainage = dfn['chainage_km'].values
    height = dfn['Surf1MinusEGM08'].values

    # calculate the parameters for the cubic spline:
    f = interpolate.interp2d(time, chainage, height, kind='cubic')

    # estimate the spline:
    height_spline = []

    for dt in range(len(time)):
        height_spline.append(f(time[dt], chainage[dt]))
        pass
    height_splines = np.asarray(height_spline)

    residuals = []
    # calculate the residuals:
    for d in range(len(height)):
        residuals.append(height[d] - height_splines[d])
        pass

    print('length residuals ', len(residuals))
    print('length df ', len(dfn))

    # save the residuals in the dataframe:
    dfn = dfn.assign(residual=residuals)

    # calculate the standard deviation
    std_res = np.std(dfn['residual'])

    # count the number of outlayer
    number_bigger_3sigma = len(dfn[np.abs(dfn['residual'])>s*std_res])

    # Find elements bigger than 3sigma and eliminate them iteratively
    while number_bigger_3sigma > 0:
        # eliminate biggest element
        # save all values higher 3 sigma in dfn_highersigme
        dfn_highersigma = dfn[np.abs(dfn['residual'])>s*std_res]

        # Extract the highest value
        max_val = np.max(np.abs(dfn_highersigma['residual']))
        dfn_drop = dfn_highersigma[np.abs(dfn_highersigma['residual']) == max_val[0]]
        try:
            dfn_drop_s
            check_i = 0
        except:
            dfn_drop_s = dfn_drop
            check_i = 1

        if check_i == 0:
            dfn_drop_s = dfn_drop_s.append(dfn_drop)
            pass

        # Eliminate the highest value
        dfn = dfn.drop(dfn_drop.index, axis=0)

        # extract the needed values from the dataframe:
        time = dfn['daytime'].values
        chainage = dfn['chainage_km'].values
        height = dfn['Surf1MinusEGM08'].values

        # Use the hight of the remaining values to calculate the polygon again
        f = interpolate.interp2d(time, chainage, height, kind='cubic')

        # estimate the spline:
        height_spline = []

        for dt in range(len(time)):
            height_spline.append(f(time[dt], chainage[dt]))
            pass

        height_splines = np.asarray(height_spline)

        residuals = []
        # calculate the residuals:
        for d in range(len(height)):
            residuals.append(height[d] - height_splines[d])
            pass

        # save the residuals in the dataframe:
        dfn = dfn.assign(residual=residuals)

        # calculate the standard deviation
        std_res = np.std(dfn['residual'])

        # calculate new number of elements bigger than 3sigma
        number_bigger_3sigma = len(dfn[np.abs(dfn['residual'])>s*std_res])

    height_spline = np.asarray(height_spline)
    z_new = []
    for dd in range(len(dfn['residual'])):
        z_new.append(dfn['residual'].values[dd] + height_spline[dd])
        pass
    dfn = dfn.assign(Surf1MinusEGM08_new = z_new)

    try:
        dfn_new
        count_i = 0
    except NameError:
         dfn_new = dfn     
         count_i = 1

    if count_i == 0:
        dfn_new = dfn_new.append(dfn) 
        pass


dfn_new = dfn_new.set_index('orbitNo')
return dfn_new, dfn_drop_s

I used for sigme_size 3 and step_size = 50 that there are enough data points for the spline calculation. Also beside there are datapoints eliminated, the outlier are still present after using the function.

Does someone has an idea how to solve this problem?

Thank you very much!

Dennis
  • 171
  • 3
  • 16

0 Answers0