I want to standardize the variables of a biological dataset. I need to run glm's, glm.nb's and lm's using different response variables.
The dataset contains counts of a given tree species by plots (all the the plots have the same size) and a series of qualitative variables: vegetation type, soil type and presence/absence of cattle.
DATA
library(standardize)
library(AICcmodavg)
set.seed(1234)
# Short version of the dataset missing other response variables
dat <- data.frame(Plot_ID = 1:80,
Ct_tree = sample(x = 1:400, replace = T),
Veg = sample(x = c("Dry", "Wet", "Mixed"), size = 80, replace = T),
Soil = sample(x = c("Clay", "Sandy", "Rocky"), size = 80, replace = T),
Cattle = rep(x = c("Yes", "No"), each = 5))
PROBLEM
As all the explanatory variables are categorical, I'm not sure whether it is possible to produce standardized lm models with standardized coefficients and standardized standard errors.
If I try to standardize through base R using scale(), I get an error because the explanatory variables are not numeric. I am trying to use the standardize R package, but I am not sure whether this is doing what I need.
MODELS
m1 <- standardize(formula = Ct_tree ~ 1, data = dat, family = "gaussian", scale = 1)
# Error in standardize(formula = Ct_tree ~ 1, data = dat, family = "gaussian": no variables in formula
m2 <- standardize(formula = Ct_tree ~ Veg, data = dat, family = "gaussian", scale = 1)
m3 <- standardize(formula = Ct_tree ~ Soil, data = dat, family = "gaussian", scale = 1)
m4 <- standardize(formula = Ct_tree ~ Cattle, data = dat, family = "gaussian", scale = 1)
m5 <- standardize(formula = Ct_tree ~ Veg + Soil, data = dat, family = "gaussian", scale = 1)
m6 <- standardize(formula = Ct_tree ~ Veg + Cattle, data = dat, family = "gaussian", scale = 1)
m7 <- standardize(formula = Ct_tree ~ Soil + Cattle, data = dat, family = "gaussian", scale = 1)
m8 <- standardize(formula = Ct_tree ~ Veg + Soil + Cattle, data = dat, family = "gaussian", scale = 1)
# m1_st <- standardize(formula = m1$formula, data = m1$data)
m2_st <- lm(formula = m2$formula, data = m2$data)
# [...]
m8_st <- lm(formula = m8$formula, data = m8$data)
# Produce a summary table of AICs
models <- list(Veg = m2_st, Soil = m3_st, Cattle = m4_st, VegSoil = m5_st, VegCattle = m6_st, SoilCattle = m7_st, VegSoilCattle = m8_st)
aic_tbl <- aictab(models, second.ord = TRUE, sort = TRUE)
QUESTIONS
1) Am I implementing the standardize package correctly?
2) Is my code doing the standardization that I am after?
3) When I call mi$data, it looks like the response variable (Ct_tree) has been standardized. Is this what it is supposed to happen? I thought that the standardization would happen to the explanatory variables, not the response.
4) How can I standardize the intercept (Ct_tree ~ 1)? Maybe it does not need to be standardized, but I still need it in the final AIC table to compare all the models.
5) I also have other response variables that are absence/presence (recoded as 0 and 1 respectively). Is it statistically correct to also standardize these columns using the same process as above? The standardize package produces a presence/absence column identical to the original. However, if I rescale such column by means of the function scale() from base R, the numbers produced are positive and negative, with decimals, and I cannot apply a binomial family.
6) If I recode the qualitative explanatory variables as ordinal (e.g. Soil = 0 for clay, 1 for sandy, 2 for rocky), and then scale them, would that be statistically correct?