1

I have been trying to create a Randomized Lasso function in R but it doesn't seem to generate the same results as the Python sklearn randomized lasso function. I'm applying the same philosophy here but couldn't understand the difference. The code was modified based on this code: randomized lasso function in R.

Here are the code and sample data:

# generate synthetic data
set.seed(100)
size = 750
x = matrix(runif(14*size),ncol=14)
y = 10 * sin(pi*X[,1]*X[,2]) + 20*(X[,3]-0.5)**2 + 10*X[,4] + 5*X[,5] + runif(1,0,1)
nbootstrap = 200
nsteps = 20
alpha = 0.2

dimx <- dim(x)
n <- dimx[1]
p <- dimx[2]
halfsize <- as.integer(n/2)
freq <- matrix(0,1,p)

for (i in seq(nbootstrap)) {

  # Randomly reweight each variable
  xs <- t(t(x)*runif(p,alpha,1))

  # Ramdomly split the sample in two sets
  perm <- sample(dimx[1])
  i1 <- perm[1:halfsize]
  i2 <- perm[(halfsize+1):n]

  # run the randomized lasso on each sample and check which variables are selected
  cv_lasso <- lars::cv.lars(xs[i1,],y[i1],plot.it=FALSE, mode = 'step')
  idx <- which.max(cv_lasso$cv - cv_lasso$cv.error <= min(cv_lasso$cv))
  coef.lasso <- coef(lars::lars(xs[i1,],y[i1]))[idx,]
  freq <- freq + abs(sign(coef.lasso))

  cv_lasso <- lars::cv.lars(xs[i2,],y[i2],plot.it=FALSE, mode = 'step')
  idx <- which.max(cv_lasso$cv - cv_lasso$cv.error <= min(cv_lasso$cv))
  coef.lasso <- coef(lars::lars(xs[i1,],y[i1]))[idx,]
  freq <- freq + abs(sign(coef.lasso))
  print(freq)
}

# normalize frequence in [0,1]
freq <- freq/(2*nbootstrap)

The results should be comparable to similar to the results shown in this table (stability) stability in python. However, this approach and the original R code show in the first hyperlink reference didn't find the correlated features X11 to X14. Not sure which part not working properly in my R code.

Kuang Liang
  • 230
  • 2
  • 5

1 Answers1

3

First thank you for posting this question. I enjoyed learning about stability selection while looking over your code and references. Second, you will probably kick yourself when you see this answer. I think your code is valid but you forgot to create the strongly correlated features of the "Friedamn #1 regression dataset." The python code from your second link is as follows:

#"Friedamn #1” regression problem
Y = (10 * np.sin(np.pi*X[:,0]*X[:,1]) + 20*(X[:,2] - .5)**2 +
     10*X[:,3] + 5*X[:,4] + np.random.normal(0,1))
#Add 3 additional correlated variables (correlated with X1-X3)
X[:,10:] = X[:,:4] + np.random.normal(0, .025, (size,4))

Your code does not include the second step. Therefore all but the first couple of features are noise and are correctly excluded from your stability selection algorithm.

Ian Wesley
  • 3,565
  • 15
  • 34