2

I have a simple table, where I am trying to extract whether my co-variates (genes) are associated with patients with cancer. Since there are a lot of co-variates (~800), I am running a logistic regression with LASSO penalty with glmnet(), and cross-validation with cv.glmnet(). The first part seems to be running ok, with no warnings. It is on the validation bit that I am getting these messages:

Warning messages:

1: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has fewer than 8 observations; dangerous ground

2: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has fewer than 8 observations; dangerous ground

3: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold

This is a sample of the data I am using (with only 7 co-variates):

> data
     Tumor     Probe_1     Probe_2    Probe_3      Probe_4    Probe_5    Probe_6     Probe_7
S_1     No -1.41509461 -3.92144111 -4.3319583 -4.894204000 -5.5379790  2.9031321  0.80587018
S_2     No -0.94584134 -2.77641045 -3.3560507 -2.211370963 -6.0006283  5.1775379  1.45389838
S_3     No -0.95188379 -3.47742475 -1.9058528 -3.019003727 -5.7203533  2.2121110  1.83080221
S_4     No -2.27462408 -3.83136845 -4.1285407 -1.691782991 -6.3683810  6.4500360  1.22882676
S_5     No -0.74983930 -2.51738976 -2.1747453 -2.279177452 -3.5778674  2.3518098  1.04400722
S_6     No -1.10189012 -3.12456412 -3.1800114 -2.567847449 -5.7474062  3.7589517  1.70868881
S_7    Yes  0.03970897 -1.98928788 -1.2119801 -0.686115233  1.0235521  0.3666321 -2.35612013
S_8    Yes  0.01597890 -1.20865821 -0.4579608 -1.192134064  1.4096178  2.4922013  0.40925359
S_9    Yes -0.27984931 -2.15706349 -2.4641827  0.047430187  1.6129360  0.5129123 -1.34833497
S_10   Yes  0.93021040 -1.97824406 -0.2918638  0.979103921 -2.5054538 -0.7654758 -2.48255982
S_11   Yes  0.83353713 -1.79506256 -2.0438707  0.460100440  0.9242979 -0.2319373 -1.51113570
S_12   Yes  0.18570649  0.05800963  0.2385482  0.433187887 -2.0097881  2.2284231  0.74761104
S_13   Yes  0.19232213 -0.95197653 -0.8496967 -0.105562938  1.0253468  0.6895510 -1.31659822
S_14   Yes  0.95731937 -1.53396032 -0.1456985  1.804472462 -3.3191177  0.2357909 -0.91231503
S_15   Yes  0.45860215 -1.36153814 -1.0998994 -0.003680416  2.0982345 -0.5042816 -1.07098039
S_16    No -0.02045748 -2.07952404 -1.5161549  1.095944357 -2.9224003  3.6426993  0.43034932
S_17    No  0.71109429 -1.19594432 -0.2472489 -0.333784895  0.7016542  0.1602559 -1.96375484
S_18    No  0.25009776 -0.98431835 -1.2113967 -0.062552222 -0.5772906  1.9909411  0.34956032
S_19    No  0.10396440 -1.43761294 -1.5490060 -0.900273908 -1.9889734  2.6280227  0.02848154
S_20    No -1.67179799 -0.69662635  0.3057564  0.497189699  1.8436791 -0.6753654 -1.74453932
S_21    No -0.33691459 -2.53752284 -2.7764968 -2.258180090  1.5861724  1.4335190  1.14224595
S_22    No -0.20888250 -3.32322098 -2.1782679  0.293379051 -5.8727867  2.3515395  1.89576377
S_23    No  0.48536983 -2.00023465 -0.8494739 -1.323411080 -6.1974792  0.2637433 -0.71707341
S_24    No  0.42733184 -2.23335363 -2.4388843  0.357150391 -2.8792254  0.4145872 -0.98182166

The Tumor column is already set as a factor:

> data$Tumor
 [1] No  No  No  No  No  No  Yes Yes Yes Yes Yes Yes Yes Yes Yes No  No  No  No  No  No  No  No  No 
Levels: No Yes

Preparing the data and running the glmnet() function:

b <- paste(colnames(data)[2:ncol(data)], collapse=" + ")
b <- as.formula(paste("~ ",b))

x <- model.matrix(b, data)

y <- data$Tumor

library("glmnet")
lasso_tumor <- glmnet(x, y, family="binomial", standardize=T, alpha=1, intercept = F)

There are no error nor warning messages up to here. But if I run the cv.glmnet() now, those warning messages show up:

> cv.lasso_tumor <- cv.glmnet(x, y, family="binomial", standardize=T, alpha=1, nfolds=10, parallel=TRUE, intercept=F)
Warning messages:
1: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground
2: In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has fewer than 8  observations; dangerous ground
3: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold

My guess is because the Tumor is too small (n=9) to run the validation, and because this step randomly splits the groups, the Tumor group will be quite limited. Does that make any sense? I read on this thread that this could be a problem, which can be handle (comment by @smci). Any idea on how it can be done?

Alternatively, would you just skip the cross-validation part, and move on only with the logit with lasso? In that case, what would be a sensible cut-off for lambda to find those genes (here named as "probes") there are associated with my binomial classification?

Any help is much appreciated. Thanks!

Community
  • 1
  • 1
Douglas
  • 185
  • 1
  • 7

1 Answers1

2

The problem is in the CV procedure, as you already figured out. If you have few observations in a class it can happen that as you spit your data into folds, in some iteration of your training folds there will be less than 8 observations of a class, which is "dangerous" for the optimization algorithm.

As a first solution you could try reducing the number of folds from 10 to say 5. If that isn't enough you could try performing stratified CV by specifying indices for each fold (argument foldid) and making sure that you have at least 8 observations in each iteration. Otherwise LOOCV is an option, which is better but more computationally intensive.

user2974951
  • 9,535
  • 1
  • 17
  • 24
  • Thanks, @user2974951. Do you know which argument would that be? I tried reducing the number of folds to the minimum (3), but still it does not work. I was also reading further [here] (https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html), and the author suggests that "As for ```glmnet```, we do not encourage users to extract the components directly". Does that mean that I should not use my coefficients from ```glmnet``` alone? – Douglas Oct 08 '19 at 12:16
  • @Douglas `foldid` is the argument which allows you to manually set observations in folds. Stratified sampling can be done with various functions from other packages (for ex. caret - createFolds). For the second part, the author is merely warning to use their functions to extract results from the model object, rather than trying to do so yourself. – user2974951 Oct 08 '19 at 12:38