Quickest way to cycle through dataframe to remove a huge jump in data

Question

I have a very annoying set of measurments, and ever so often it just blibs and has a very high values followed by a very low one compared to the normal pattern.

I'm trying to get rid of these and was wondering if there was a way to take away the previous point efficently and if thats greater than (for arbitary sake 20) to delete both points:

This is what I have so far but I have a huge data set so want to avoid for loops

for i in range(1,len(df1)):

    if abs(df1[TYPE].loc[i]-df1[TYPE].loc[i-1])>20:
                             #update new columns
        df1[TYPE].loc[i-1] = -9999;
        df1[TYPE].loc[i] =-9999;

where TYPE is the name of my column variable This method also says:

TypeError: cannot do index indexing on <class `'pandas.tseries.index.DatetimeIndex'> with these indexers [1] of <class 'int'>`

I was also thinking somthing on these liens was an option- but Im not sure how to adapt it so it wouldn't do the whole dataframe in one go but compare each location

df1.insert(0,"newcol1","-9999")
df1['newcol1'] = df1[TYPE].shift() == df1[TYPE]
df1.ix[0, 'newcol1'] = True 

if df1[['newcol1']] > 50:
    print("Jhrjhfh!")
    df1[TYPE]=-9999
    df1[TYPE].shift=-9999

presumably you could calculate the variance or standard deviation and just remove the rows that are outliers? — EdChum, Jul 20 '16 at 14:24
Nope, I can't use standard deviation reason being is that. There are peaks from the data that are artifacts of the meterology (aka high wind speeds effect the data but thats science) and then there is the cases like this where it is the instrument. hence why I am trying to get it to look at the next point along — SLE, Jul 20 '16 at 14:26
Is it consistent the behaviour though? You say very high followed by very low? You can use `diff` to detect this and see if there is a reliable pattern to use to remove the erroneous rows — EdChum, Jul 20 '16 at 14:27
So using standard deviation would take out all points above a threshold including any gradually changes caused by the natural differences you get from one month/year to another — SLE, Jul 20 '16 at 14:27
You should though be able to use `diff` and maybe with `cumsum()` to detect these peaks to see if they are separable from the real data — EdChum, Jul 20 '16 at 14:30
Would the diff function work like this then? a=df1.loc[df1[TYPE].diff(periods=1,axis=1)>20,df1[TYPE]] = -9999 — SLE, Jul 20 '16 at 14:35
Oh it can't seem to deal with NaN minus NaN as I have NaN's in my data — SLE, Jul 20 '16 at 14:36
Generally speaking Pandas will act on the whole DF. The most elegant solution if you are okay with acting on the whole DF is shifting shown here (http://stackoverflow.com/questions/30673209/pandas-compare-next-row). If you can't do the whole DF, is not the fastest way but you could always iterate each row in a for loop, build a list for the column, and compare with the value of i-1. Then if it exceeds the threshold, you could print. — PhysicalChemist, Jul 20 '16 at 16:40

Jonathan Koren · Answer 1 · 2017-12-06T05:27:24.500

0

You could apply this method to each column that you want to filter. This would filter out blips that occur in 1 frame.

max_delta = 50    
delta = df1['colname'].diff().shift(-1).abs()
delta.iloc[-1] = delta.iloc[-2]
df1 = df1[(delta < max_delta)]

edited Dec 06 '17 at 05:27

answered Sep 14 '17 at 13:37

Jonathan Koren

883
7
9

Quickest way to cycle through dataframe to remove a huge jump in data

1 Answers1