0

I'd like to perform a nonlinear regression for dimensionality reduction with a dataset that has more predictors than observations, and predictors can also be multicollinear [edit: it is similar to a gene expression data set]. What I have found by googling is that a GAM model with smoothing function + using L1 penalty could do the job, however when I try to implement such a model using the R package in mgcv I get early on the Error: model has more coefficients than data.

After reading the answer to this question I assume that I cannot calculate a GAM with more predictors than observations using mgcv. Can someone point me in the direction which package is suitable for my quest, or if I have made a mistake with my code?

Here is an example code of what I have tried and that gives the same error. Note that my "real" dataset has p>n [edit: and all variables are numeric]

library(mgcv)
set.seed(2) 
dat <- gamSim(7, n=40, scale=2) #get some example data
colnames(dat) 
#"y"  "x0" "x1" "x2" "x3" "f"  "f0" "f1" "f2" "f3"
b <- gam(y ~ s(x0)+s(x1)+s(x2)+s(x3)+s(f)+s(f0)+s(f1)+s(f2),
         data=dat, select= T)
summary(b)
#error: model has more coefficients than data
midas
  • 3
  • 4
  • Have a look at the fused lasso or grouped lasso models for dealing with groups of predictors; you'll need to create the spline bases yourself and form the model matrix you need, but those methods should work in a roughly similar way. Also, consider trend filtering; the only caveat is that I don't know if it handles the p>n case but as a method it comes from the same world as the fused lasso IIRC. – Gavin Simpson Feb 05 '21 at 16:44
  • Thank you, this is very helpful - I had not considered this before and it looks promising! – midas Feb 05 '21 at 18:51
  • @GavinSimpson I had a look at fused lasso (specifically at the package ````msgl```` which has a good vignette), but as far as I have understood they only work when you have a grouping factor, yet my response & predictors are all numeric (I forgot to specifiy this in my question, will edit it now), and I don't know which of my predictors group - the data is comparable to a gene expression data set. – midas Feb 08 '21 at 14:51
  • This question may be better suited for https://stats.stackexchange.com/ where they are more interested in statistics and algorithms than they are about code. – Adam Sampson Feb 08 '21 at 15:07
  • Yes, it seems it is fitting more there now that I know it not a mistake with the code, I will post it there. Thank you for your answers! – midas Feb 08 '21 at 15:18

0 Answers0