Standardizing qualitative variables in R to perform glm's, glm.nb's and lm's

Question

I want to standardize the variables of a biological dataset. I need to run glm's, glm.nb's and lm's using different response variables.

The dataset contains counts of a given tree species by plots (all the the plots have the same size) and a series of qualitative variables: vegetation type, soil type and presence/absence of cattle.

DATA

library(standardize)
library(AICcmodavg)

set.seed(1234)
# Short version of the dataset missing other response variables
dat <- data.frame(Plot_ID = 1:80,
                  Ct_tree = sample(x = 1:400, replace = T),
                  Veg = sample(x = c("Dry", "Wet", "Mixed"), size = 80, replace = T),
                  Soil = sample(x = c("Clay", "Sandy", "Rocky"), size = 80, replace = T),
                  Cattle = rep(x = c("Yes", "No"), each = 5))

PROBLEM

As all the explanatory variables are categorical, I'm not sure whether it is possible to produce standardized lm models with standardized coefficients and standardized standard errors.

If I try to standardize through base R using scale(), I get an error because the explanatory variables are not numeric. I am trying to use the standardize R package, but I am not sure whether this is doing what I need.

MODELS

m1 <- standardize(formula = Ct_tree ~ 1, data = dat, family = "gaussian", scale = 1)
# Error in standardize(formula = Ct_tree ~ 1, data = dat, family = "gaussian": no variables in formula
m2 <- standardize(formula = Ct_tree ~ Veg, data = dat, family = "gaussian", scale = 1)
m3 <- standardize(formula = Ct_tree ~ Soil, data = dat, family = "gaussian", scale = 1)
m4 <- standardize(formula = Ct_tree ~ Cattle, data = dat, family = "gaussian", scale = 1)
m5 <- standardize(formula = Ct_tree ~ Veg + Soil, data = dat, family = "gaussian", scale = 1)
m6 <- standardize(formula = Ct_tree ~ Veg + Cattle, data = dat, family = "gaussian", scale = 1)
m7 <- standardize(formula = Ct_tree ~ Soil + Cattle, data = dat, family = "gaussian", scale = 1)
m8 <- standardize(formula = Ct_tree ~ Veg + Soil + Cattle, data = dat, family = "gaussian", scale = 1)

# m1_st <- standardize(formula = m1$formula, data = m1$data)
m2_st <- lm(formula = m2$formula, data = m2$data)
# [...] 
m8_st <- lm(formula = m8$formula, data = m8$data)

# Produce a summary table of AICs
models <- list(Veg = m2_st, Soil = m3_st, Cattle = m4_st, VegSoil = m5_st, VegCattle = m6_st, SoilCattle = m7_st, VegSoilCattle = m8_st)
aic_tbl <- aictab(models, second.ord = TRUE, sort = TRUE)

QUESTIONS

1) Am I implementing the standardize package correctly?

2) Is my code doing the standardization that I am after?

3) When I call mi$data, it looks like the response variable (Ct_tree) has been standardized. Is this what it is supposed to happen? I thought that the standardization would happen to the explanatory variables, not the response.

4) How can I standardize the intercept (Ct_tree ~ 1)? Maybe it does not need to be standardized, but I still need it in the final AIC table to compare all the models.

5) I also have other response variables that are absence/presence (recoded as 0 and 1 respectively). Is it statistically correct to also standardize these columns using the same process as above? The standardize package produces a presence/absence column identical to the original. However, if I rescale such column by means of the function scale() from base R, the numbers produced are positive and negative, with decimals, and I cannot apply a binomial family.

6) If I recode the qualitative explanatory variables as ordinal (e.g. Soil = 0 for clay, 1 for sandy, 2 for rocky), and then scale them, would that be statistically correct?

Sounds to me most of your questions would probably be better addressed on https://stats.stackexchange.com/ . — niko, Dec 28 '18 at 09:00
I appreciate your suggestion @nate.edwinton! I will explore that opcion. — Darius, Dec 29 '18 at 11:13

score 7 · Accepted Answer · answered Dec 28 '18 at 13:12

My answer could be opinionated. In addition, I am a biologist and not a mathematician, so somebody with better mathematical background could give a more reasoned answer.

The first question is why do we need standardization? Basically, we use it in order to compare the effect size of different predictors. Let's say we want to estimate how plant mass (M) depends on the concentration of nitrogen in soil (N) and availability of water (W). There will be two predictors with different units and different amplitude. It is very important that both predictors are continuous. We can estimate regression coefficients from raw data. Let's assume that final biomass could be represented as

M = 0.1 * N + 0.2 * W + error

So, which factor is more important? Of course, we can not deduce that only from those coefficients. For comparison, we need to take into account units and variability of factors. Therefore reporting coefficients only, could be insufficient for understanding your foundings. The standardization could be a solution for such a situation.

Now let's assume that we got the same regression coefficients, but predictors were previously standardized. In such a situation it is clear that the plant mass changes by 0.1 when the nitrogen concentration changes by 1 standard deviation unit. The same is for water (0.2 mass unit per 1 sd unit). If your experiment incorporated wide ranges of water and nitrogen conditions, you can suggest that water is twice more important then nitrogen. So standardization is useful for comparing the effects of continuous predictors.

In your case, predictors are categorical i.e. factors. And your initial question is "does trees number differ in different condition groups?". Here your result will be an amount of difference. For example, there are on average 50 more trees per plot on clay soil than on sandy soil. It is quite a clear result. If some condition is responsible for a higher change in trees number it has a stronger effect. So it seems that standardization is not required.

Still, you can ask an additional question "Is the difference of 50 trees is a high difference?". If the average number of trees is 10000, the addition of 50 trees is neglectable. But if there are on average 100 trees per plot, the changes are really big. In order to deal with such a problem, you can standardize your response variable. Thus, you will get the difference in standard deviation units (something like Cohen's d).

In any case, the choice to standardize or not to standardize should be your decision, based on your expertise in the field. If standardization will help you to interpret your results, then do it. If you think that the difference in standard deviation units is more illustrative and will be more understandable for your readers then do it. As for me, I'd suggest staying with raw values but presenting results in relative units (%). For example, there are 15% more trees on clay than on sand soils. But again it should be your decision.

As a conclusion:

You do not need standardization for categorical variables. The same goes for binary predictors.
You can standardize your response variable if you think that the difference in SD units is more convincing. For this base::scale will be enough.
Replacing categorical variables with numbers (and treating them as numbers) is not a good idea in your case.

P.S. sorry for bad grammar.

Tha was extremely useful @Istrel. I've always been inclined to use raw values, when possible, in my analysis. — Darius, Dec 31 '18 at 06:07

Standardizing qualitative variables in R to perform glm's, glm.nb's and lm's

1 Answers1