7

Using a penalized spline of mgcv, I want to obtain effective degrees of freedom (EDF) of 10 /year in the example data (60 for the entire period).

library(mgcv)
library(dlnm) 
df <- chicagoNMMAPS

df1<-subset(df, as.Date(date) >= '1995-01-01') 

mod1 <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + as.factor(dow) 
,family=quasipoisson,na.action=na.omit,data=df1) 

In the example data the basis dimension for time as measured by edf for time is 56.117, which is less than 10 per year.

summary(mod1)


Approximate significance of smooth terms:
           edf Ref.df     F p-value    
s(time) 56.117 67.187 5.369  <2e-16 ***
s(temp)  2.564  3.204 0.998   0.393    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.277   Deviance explained = 28.2%
GCV score = 1.1297  Scale est. = 1.0959    n = 2192

Manually I will change the edf a by supplying smoothing parameters as follows

mod1$sp

 s(time)  s(temp) 

23.84809 17.23785 

Then I will plug the sp output into a new model and rerun it. Basically I will continue to alter the sp until I obtain edf of around 60. I will alter only the smoothing parameter for time.

I will start with a lower value and check the edf:

mod1a <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + as.factor(dow) 
,family=quasipoisson,na.action=na.omit,data=df1, sp= c(12.84809,  17.23785 
))
summary(mod1a)
#  edf  62.997

I have to increase the smoothing parameters for time to bring down the edf to around 60.

mod1b <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + as.factor(dow) 
,family=quasipoisson,na.action=na.omit,data=df1, sp= c(14.84809,  17.23785 
))
summary(mod1b)
edf  61.393  ## EDF still large, thus I have to increase the sp`

mod1c <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + as.factor(dow) 
,family=quasipoisson,na.action=na.omit,data=df1, sp=c(16.8190989, 17.23785)) 
summary(mod1c)

edf= 60.005  ## This is what I want to obtain as a final model.

How can one achieve this final result with an efficient code?

Meso
  • 1,375
  • 5
  • 17
  • 36

2 Answers2

5

I don't understand the details of your model, but if you are looking to minimize (or maximize) edf for models fitted with different sp, optim will do the job. First, create a function that returns just the edf given different values of sp.

edf.by.sp<-function(sp) {
  model <-gam(resp ~ s(time,bs='cr',k=6*15, fx=F)+ s(temp,k=6, bs='cr') + 
                as.factor(dow),
              family=quasipoisson,
              na.action=na.omit,
              data=df1, 
              sp= c(sp,  17.23785) # Not sure if this quite right.
  )
  abs(summary(model)$s.table['s(time)','edf']-60) # Subtract 60 and flip sign so 60 is lowest.
}

Now, you can just run optim to minimize edf:

# You could pick any reasonable starting sp value.
# Many optimization methods are available, but in your case
# they work equally well.
best<-optim(12,edf.by.sp,method='BFGS')$par
best
# 16.82708

and, subbing back in, you get nearly 0 (exactly 60 before transforming) when plugging in the function:

edf.by.sp(best) # 2.229869e-06
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • Thanks for the response. Have you tried with the data that I have provided? – Meso Dec 15 '13 at 16:20
  • @Meso Sorry, I didn't understand that you *had* provided the data. I updated my answer, and found that my function indeed works correctly with your data. – nograpes Dec 16 '13 at 05:07
  • Many thanks for the clean code. It worked on the example data set perfectly. It also worked on my own data with slight modification. It does somewhat underestimates the basis dimension when I use some combinations of k. I think this is not due to the code, rather some glitches in the data or /model. All in all thank you very much and you deserve more than 50 bounties, but I am a poor guy as you see! – Meso Dec 18 '13 at 16:44
3

Why use a penalized spline and then modify it's smoothing parameters to create a fixed regression spline? Makes no sense to me.

A fixed df cubic regression spline with 60 edf is fitted like this:

mod1 <-gam(resp ~ s(time,bs='cr',k=61,fx=TRUE)+ 
                  s(temp,k=6, bs='cr') + as.factor(dow) 
                  ,family=quasipoisson,na.action=na.omit,data=df1) 

Which gives a perfect:

> summary(mod1)

Family: quasipoisson 
Link function: log 
...
Approximate significance of smooth terms:
           edf Ref.df     F p-value    
s(time) 60.000 60.000 6.511  <2e-16 ***
s(temp)  2.505  3.165 0.930   0.427    

If you want a penalized spline, then use a penalized spline and accept that the core idea of penalization is exactly that you do NOT have a fixed edf.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • I am not a statistician, but I have observed that when I use penalized spline, sometimes it doesn't adequately control for time trend . In our discipline we control for the effect of time trend on mortality with adjustment for time trend with 6df/ year( varies from 4 to 8 etc..)Thus the consensus in some circles now is to do a number of sensitivity analysis which includes use of an accepted fixed degree of freedom/year or set a minimum value for df when using penalized spline. – Meso Dec 18 '13 at 16:54
  • @Meso the whole point of a penalized spline is to avoid overfitting by penalizing complexity. If you want a minimum edf, fit a penalized spline first. If the edf is smaller than your minimum, refit the model with a minimum set. What you do in your example, is try to fit a model with *exactly* 60 edf. And that's done correctly as in my example. – Joris Meys Dec 18 '13 at 19:27
  • 1
    @Meso Next to that: using a gam for time trends doesn't take autocorrelation into account. You need a gamm for that. Plus, thin plate regression splines are arguably better for time trends, as they are not dependent on an arbitrary choice of knots. Contrary to the use of thin plate regression splines, the consensus approach does not allow for solid statistical interpretation. You might like this book : http://www.amazon.com/Generalized-Additive-Models-Introduction-Statistical/dp/1584884746 It's from the author of the mgcv package, rock solid and well written. – Joris Meys Dec 18 '13 at 19:32
  • Thanks for your comments & suggestion about gamm & thin plate regression splines. As a non-statistician, I am not the best to comment on the overall function of penalized splines. But when I used it time trend of mortality was not adequately controlled. As I doubted the results, I have to make sure that trend was adjusted with a minimum of 6df /year. In that case I can assure myself that the estimates are not due to time trend. Checkout the following publication for some more details : http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3764077/ – Meso Dec 21 '13 at 19:18
  • @Meso If you're not interested in the time trend, you need a random effect for that to start with. And you still need to correct for autocorrelation in the data. The problem is not so much the estimates of the effect, but the estimates of the variance around the effect. In your analysis,they're likely to be too optimistic. On a side note: PM is strongly correlated with the difference between week and weekend, the wind direction, wind speed, presence of rainfall and temperature. There's a strong seasonal trend, and you need a AR1 or CAR1 variance model for correction of autocorrelation. – Joris Meys Dec 21 '13 at 19:50
  • I worked with daily averaged data. If you take 15min measurements of the TOEM, you have residual autocorrelation with a CAR1. Simon Wood's book has some nice examples on that. I'd send you my report as well, but that was written in Dutch and was ordered by the government. So I'm alas not free to share it... – Joris Meys Dec 21 '13 at 19:55
  • Thanks again. My interest with time trend is as a confounder for mortality and air pollution association. I will check the PACF of the models and include AR terms if necessary. I also adjust for some the confounders you have listed(day of the week, temperature), but not for all. – Meso Dec 21 '13 at 20:19