16

I'm new to the regression game and hope to plot a functionally arbitrary, nonlinear regression line (plus confidence and prediction intervals) for a subset of data that satisfies a certain condition (i.e. with mean replicate value exceeding a threshold; see below).

The data is generated for independent variable x across 20 different values: x=(20-np.arange(20))**2, with rep_num=10 replicates for each condition. The data shows strong nonlinearity across x and looks like the following:

import numpy as np

mu = [.40, .38, .39, .35, .37, .33, .34, .28, .11, .24,
      .03, .07, .01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]     

data = np.zeros((20, rep_num))
for i in range(13):
    data[i] = np.clip(np.random.normal(loc=mu[i], scale=0.1, size=rep_num), 0., 1.)

I can make a scatter plot of the data; the replicate means are shown by the red dots:

import matplotlib.pyplot as plt

plt.scatter(np.log10(np.tile(x[:,None], rep_num)), data, 
            facecolors='none', edgecolors='k', alpha=0.25)
plt.plot(np.log10(x), data.mean(1), 'ro', alpha=0.8)
plt.plot(np.log10(x), np.repeat(0., 20), 'k--')
plt.xlim(-0.02, np.max(np.log10(x)) + 0.02)
plt.ylim(-0.01, 0.7)

scatter plot

My goal is to plot a regression line for only those data that have replicate mean > 0.02. In addition, I would like to add a 95% confidence interval (black dashed lines) around the regression, as well as a 95% prediction interval (blue dashed lines) -- ideally, the prediction interval can also be colored in with transparent blue background.

The final plot (without the blue background inside the prediction interval) would look something like this:

enter image description here

How would I make this? My online search yielded very different partial approaches using seaborn, scipy, and statsmodels. The applications of some of those template functions did not appear to work alongside the existing matplotlib scatter plot.

neither-nor
  • 1,245
  • 2
  • 17
  • 30
  • Do you have any model for your regression? 'non-linear' can be a lot.. – rammelmueller Aug 31 '17 at 07:09
  • @rammelmuller No, I'm just trying to curve-fit and show the general trend of the data. So far, the best model I've tested under `scipy.optimize` is `a*np.log2(c+x)+b`, but it still doesn't capture the saturation part well. – neither-nor Aug 31 '17 at 07:13
  • 1
    Aha.. I guess predicting a general trend is going to be though with some reasonable certainty as the variation between different random sets seems to be rather large - once a second order polynomial will do the job and sometimes some other function will score better.. with enough parameters you can do anything really.. – rammelmueller Aug 31 '17 at 07:20
  • I should specify that I mainly want to plot the general trend of the mean for the last 13 data points (red dots). Would the variation between different random sets matter for the confidence interval of the mean in this case? – neither-nor Aug 31 '17 at 07:49
  • I would guess this depends on how good the model is. – rammelmueller Aug 31 '17 at 07:53
  • Off the top of my head I would maybe try a 'sigmoid' function over all the data.. – rammelmueller Aug 31 '17 at 07:59
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/153326/discussion-between-neither-nor-and-rammelmuller). – neither-nor Aug 31 '17 at 08:00
  • It seems you are asking for a good model to fit your data. I don't think this is on-topic on Stackoverflow. Once you have a model, the question may be how to implement it - this would be on-topic. – ImportanceOfBeingErnest Aug 31 '17 at 10:54
  • I'm not fussy about the model used for fitting. The main goal is to plot something similar to the above picture, showing the general trend of the data in the right hand portion. – neither-nor Aug 31 '17 at 17:15

1 Answers1

5

OK, here's a shot at this (withouth prediction band, though). First of all you want to select the applicable data:

threshold = 0.02
reg_x = np.log10(x)[data.mean(1)>threshold]
reg_y = data.mean(1)[data.mean(1)>threshold]

Then you choose a model and perform a fit. Note, here I chose a second order polynomial but in principle you could do anything. For the fits I use kapteyn, this has a built-in confidence bans method, although it would be straightforward to implement (see e.g. Delta method)

from kapteyn import kmpfit

# Set model to fit.
def model(p, x):
    a, b, c = p
    return a + b*x + c*x**2

# Perform fit.
f = kmpfit.simplefit(model, [.1, .1, .1], reg_x, reg_y)

f contains all the estimated parameters and such, you can use that for plotting etc.

x = np.linspace(0, 3, 100)
plt.plot(x, model(f.params, x), linestyle='-', color='black', marker='')

For the confidence bands, we need the partial derivatives of the model with respect to the parameters (yes, some math). Again, this is easy for a polynomial model, shouldn't be a problem for any other model either.

# Partial derivatives:
dfdp = [1., reg_x, reg_x**2]
_, ci_upper, ci_lower = f.confidence_band(reg_x, dfdp, 0.95, model)

# Plot.
plt.plot(reg_x, ci_upper, linestyle='--', color='black', marker='')
plt.plot(reg_x, ci_lower, linestyle='--', color='black', marker='')

Unfortunately there is not prediction_bands() routine in the package, at least not that I know of. Assume you found some method for the prediction band, the plotting and preparation would look the same though..

p_upper, p_lower = prediction_band(*args, **kwargs)
plt.fill_between(reg_x, p_upper, p_lower, facecolor='blue', alpha=0.2, linestyle='')

Hope this helps, L.

rammelmueller
  • 1,092
  • 1
  • 14
  • 29