0

I have this data that I fit a linear function to and the fit determines other work (never mind, not important). I'm using numpy.polyfit, and when I simply include the data and the degree of the fit, nothing else, it produces this plot:

enter image description here

Now, the fit is okay, but the general consensus is the line of best fit is being skewed by those red data points above it and I should actually be fitting to the data just below it which forms a nice linear shape (beginning around that congested blob of blue points). So I attempted to add a weighting to my call to polyfit, and I chose an arbitrary weighting of 1/sqrt(y-values), so basically the smaller y-values will be weighted towards more favourably. This gave the following:

enter image description here

Which admittedly is better but I'm still unsatisfied, as now it appears the line is too low. I would ideally like a middle-ground, but since I chose really an arbitrary weighting, I was wondering if in general there is a way to perform a more robust fit using Python, or even if this can be done using polyfit? Using a separate package if it works will be fine too.

Chris Martin
  • 30,334
  • 10
  • 78
  • 137
  • 2
    Yes, Python has many advanced packages for statistics. But this is more of a Statistics, then a programming question. Look up `Classification for outlier removal`, `clustering`, `k-nearest neighbor`, `RANSAC`, `robust regression`. In the end, understanding your experiment and possibly finding reasons to exclude certain data is typically the best first order approach. – roadrunner66 Mar 07 '16 at 04:03
  • Thanks a lot @roadrunner66! – Joshua D'Agostino Mar 07 '16 at 04:38

2 Answers2

0

This question doesn't really have much to do with programming or python and more to do with statistics or linear algebra.

You could try seeing the error difference between a best fit line or best fit quadratic see which has less error. But a lot of it is context related.

If you have 500 data points, then you could find a 500th order polynomial to model your dataset with zero error. But if you weight your data points then it needs to make sense for the data.

If you want your best fit line to "look right" then just cut the foreplay and draw it where you want it. If you want it to make sense then ask a mathematician for a formula that makes sense then follow it.

russloewe
  • 11
  • 2
0

statsmodels has robust linear estimators, RLM, with various weight functions that should work well in cases like this.

http://www.statsmodels.org/dev/generated/statsmodels.robust.robust_linear_model.RLM.html http://www.statsmodels.org/dev/examples/index.html#robust

These are M-estimators that are robust to "y outliers", but not to "x outliers" that are influential outlying regressors.

Josef
  • 21,998
  • 3
  • 54
  • 67