Getting rid of spikes in sample data

Question

How could I get rid of sparky data in a descrete data set, but in a "smoother out" manner?

Take for instance

enter image description here

There are two sparks, at 20000, but the next one at 600 is also considered a spark.

I've managed to get the very high ones to zero, by

a = 2
b = 5
beta_dist = RealDistribution('beta', [a, b])
f(x) = x / 19968
normalized_insertions = [f(i) for i in insertions]

insertions_pairs = [(i, beta_dist.distribution_function(i)) for i in normalized_insertions]
plot_b = beta_dist.plot()

show(list_plot(insertions_pairs)+plot_b)

No idea how to go about the lower ones. The maximul should be reached at 100, perhaps the parameters for the beta distribution need a little more twiddling?

Currently, it looks like this: enter image description here

If possible, use sage as a reference for your explanations.

Are you looking for a way to perform data-smoothing? If so, then applying a median filter as Paul R suggests will do the trick. Also, what exactly are you trying to measure with this data, and why did you choose to use a beta distribution? — xvtk, Sep 14 '12 at 14:29
@PaulR I would be glad to accept your answer, if you posted it as such. — Flavius, Sep 14 '12 at 16:56

score 2 · Accepted Answer · answered Sep 15 '12 at 05:51

2

You could use a median filter, perhaps 3 or 5 points. This would remove isolated outliers as in your data above.

answered Sep 15 '12 at 05:51

Paul R

208,748
37
389
560

score 1 · Answer 2 · answered Oct 22 '12 at 10:37

You maybe should look at a Kalman filter. This will determine the deviation of your data and smooth about a gaussian mean. Thus the 20k numbers will have almost no effect at all, while 600's will have more effect they will still be massively over taken by the consistency of your data. If you like math:
http://www.cs.berkeley.edu/~pabbeel/cs287-fa11/slides/Smoother_KalmanSmoother--DRAFT.pdf
Otherwise maybe:
http://interactive-matter.eu/blog/2009/12/18/filtering-sensor-data-with-a-kalman-filter/

Getting rid of spikes in sample data

2 Answers2