3

I'm trying to do a linear regression but I'm only looking to use variables with positive coefficients (I think this is called hard-thresholding, but I'm not certain).

for example:

> summary(lm1)

Call:
lm(formula = value ~ ., data = intCollect1[, -c(1, 3)])

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6518  -0.2089  -0.0227   0.2035  15.2235 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.099763   0.024360   4.095 4.22e-05 ***
modelNum3802    0.208867   0.008260  25.285  < 2e-16 ***
modelNum8000   -0.086258   0.013104  -6.582 4.65e-11 ***
modelNum8001   -0.058225   0.010741  -5.421 5.95e-08 ***
modelNum8002   -0.001813   0.012087  -0.150 0.880776    
modelNum8003   -0.083646   0.011015  -7.594 3.13e-14 ***
modelNum8004    0.002521   0.010729   0.235 0.814254    
modelNum8005    0.301286   0.011314  26.630  < 2e-16 ***

In the above regression, I would only want to use models 3802, 8004 and 8005. Is there a way to do this without copying and pasting each variable name?

screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • 2
    You can check `?coef`. But if you remove variables with a negative coefficient and recompute the regression, the sign of the remaining coefficients can change. – Vincent Zoonekynd Mar 28 '12 at 22:42

3 Answers3

3

Instead of using lm, you can formulate your problem in terms of a Quadratic Programming:

Minimize the sum of the squared replication errors subject to the constraint that your linear coefficients are all positive.

Such problems can be solved using lsei from the limSolve package. Looking at your example, it would look a lot like this:

x.variables <- c("modelNum3802", "modelNum8000", ...)
num.var <- length(x.variables)

lsei(A = intCollect1[, x.variables],
     B = intCollect1$value,
     G = diag(num.var),
     H = rep(0, num.var))
flodel
  • 87,577
  • 21
  • 185
  • 223
1

I found the nnls (non-negative least square) package to be worth looking at.

screechOwl
  • 27,310
  • 61
  • 158
  • 267
0

You can also reformulate your linear regression model in the following way: label ~ sum(exp(\alpha_i) f_i)

the optimization target will be sum_j (label_j - sum_i(exp(\alpha_i) f_i))^2

This has no closed form solution but can be solved efficiently since it's convex in the \alpha_i's.

Once you compute the \alpha_i's, you can recast them as the regressors of a usual linear model by exponentiating them.

John Jiang
  • 282
  • 2
  • 13