"The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" (p. 394).
Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in medicine, 16(4), 385-395. http://statweb.stanford.edu/~tibs/lasso/fulltext.pdf
H2O:
Similar to package ‘glmnet,’ the h2o.glm function includes a ‘standardize’ parameter that is true by default. However, if predictors are stored as factors within the input H2OFrame, H2O does not appear to standardize the automatically encoded factor variables (i.e., the resultant dummy or one-hot vectors). I've confirmed this experimentally, but references to this decision also show up in the source code:
For instance, method denormalizeBeta (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L359) includes the comment "denormalize only the numeric coefs (categoricals are not normalized)." It also looks like means (variable _normSub) and standard deviations (inverse of variable _normMul) are only calculated for the numerical variables, and not the categorical variables, in the setTransform method (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L599).
GLMnet:
In contrast, package 'glmnet' seems to expect categorical variables to be dummy-coded prior to fitting a model, using a function like model.matrix. The dummy variables are then standardized along with the continuous variables. It seems like the only way to avoid this would be to pre-standardize the continuous predictors only, concatenate them with the dummy variables, and then run glmnet with standardize=FALSE.
Statistical Considerations:
For a dummy variable or one-hot vector, the mean is the proportion of TRUE values, and the SD is directly proportional to the mean. The SD reaches its maximum when the proportion of TRUE and FALSE values is equal (i.e., σ = 0.5), and the sample SD (s) approaches 0.5 as n → ∞. Thus, if continuous predictors are standardized to have SD = 1, but dummy variables are left unstandardized, the continuous predictors will have at least twice the SD of the dummy predictors, and more than twice the SD for imbalanced dummy variables.
It seems like this could be a problem for regularization (LASSO, ridge, elastic net), because the scale/variance of predictors is expected to be equal so that the regularization penalty (λ) applies evenly across predictors. If two predictors A and B have the same standardized effect size, but A has a smaller SD than B, A will necessarily have a larger unstandardized coefficient than B. This means that, if left unstandardized, the regularization penalty will erroneously be more severe to A than B. In a regularized regression with a mixture of standardized continuous predictors and unstandardized categorical predictors, it seems like this could lead to systematic over-penalization of categorical predictors.
A commonly expressed concern is that standardizing dummy variables removes their normal interpretation. To avoid this issue, while still placing continuous and categorical predictors on an equal footing, Gelman (2008) suggested standardizing continuous predictors by dividing by 2 SD, rather than 1, resulting in standardized predictors with SD = 0.5. However, it seems like this would still be biased for class-imbalanced dummy variables, for which the SD might be substantially less than 0.5.
Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in medicine, 27(15), 2865-2873. http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf
Question:
Is H2O's approach of not standardizing one-hot vectors for regularized regression correct? Could this lead to a bias toward over-penalizing dummy variables / one-hot vectors? Or has Tibshirani (1997)'s recommendation since been revised for some reason?