0

I have a very annoying set of measurments, and ever so often it just blibs and has a very high values followed by a very low one compared to the normal pattern.

I'm trying to get rid of these and was wondering if there was a way to take away the previous point efficently and if thats greater than (for arbitary sake 20) to delete both points:

This is what I have so far but I have a huge data set so want to avoid for loops

for i in range(1,len(df1)):

    if abs(df1[TYPE].loc[i]-df1[TYPE].loc[i-1])>20:
                             #update new columns
        df1[TYPE].loc[i-1] = -9999;
        df1[TYPE].loc[i] =-9999;

where TYPE is the name of my column variable This method also says:

TypeError: cannot do index indexing on <class `'pandas.tseries.index.DatetimeIndex'> with these indexers [1] of <class 'int'>`

I was also thinking somthing on these liens was an option- but Im not sure how to adapt it so it wouldn't do the whole dataframe in one go but compare each location

df1.insert(0,"newcol1","-9999")
df1['newcol1'] = df1[TYPE].shift() == df1[TYPE]
df1.ix[0, 'newcol1'] = True 

if df1[['newcol1']] > 50:
    print("Jhrjhfh!")
    df1[TYPE]=-9999
    df1[TYPE].shift=-9999
SLE
  • 65
  • 1
  • 17
  • presumably you could calculate the variance or standard deviation and just remove the rows that are outliers? – EdChum Jul 20 '16 at 14:24
  • Nope, I can't use standard deviation reason being is that. There are peaks from the data that are artifacts of the meterology (aka high wind speeds effect the data but thats science) and then there is the cases like this where it is the instrument. hence why I am trying to get it to look at the next point along – SLE Jul 20 '16 at 14:26
  • Is it consistent the behaviour though? You say very high followed by very low? You can use `diff` to detect this and see if there is a reliable pattern to use to remove the erroneous rows – EdChum Jul 20 '16 at 14:27
  • So using standard deviation would take out all points above a threshold including any gradually changes caused by the natural differences you get from one month/year to another – SLE Jul 20 '16 at 14:27
  • And nope sadly not consistent at all. – SLE Jul 20 '16 at 14:29
  • You should though be able to use `diff` and maybe with `cumsum()` to detect these peaks to see if they are separable from the real data – EdChum Jul 20 '16 at 14:30
  • Would the diff function work like this then? a=df1.loc[df1[TYPE].diff(periods=1,axis=1)>20,df1[TYPE]] = -9999 – SLE Jul 20 '16 at 14:35
  • Oh it can't seem to deal with NaN minus NaN as I have NaN's in my data – SLE Jul 20 '16 at 14:36
  • you should filter those first using `dropna` – EdChum Jul 20 '16 at 14:38
  • Generally speaking Pandas will act on the whole DF. The most elegant solution if you are okay with acting on the whole DF is shifting shown here (http://stackoverflow.com/questions/30673209/pandas-compare-next-row). If you can't do the whole DF, is not the fastest way but you could always iterate each row in a for loop, build a list for the column, and compare with the value of i-1. Then if it exceeds the threshold, you could print. – PhysicalChemist Jul 20 '16 at 16:40
  • Oh thats awesome many thanks! – SLE Jul 21 '16 at 12:45

1 Answers1

0

You could apply this method to each column that you want to filter. This would filter out blips that occur in 1 frame.

max_delta = 50    
delta = df1['colname'].diff().shift(-1).abs()
delta.iloc[-1] = delta.iloc[-2]
df1 = df1[(delta < max_delta)]
Jonathan Koren
  • 883
  • 7
  • 9