1

I´m trying to create a function to (visually) compare the distribution of a variable, with that of the same variable after a Box-Cox transformation. The variable is a single column pulled out of my entire data frame.

library(EnvStats)

bc_compare_1 <- function(var){
  bc_var <- boxcox(lm(var ~ 1))
  lambda <- bc_var$x[which.max(bc_var$y)]
  var_T <- (var^lambda - 1)/lambda
  g <- ggarrange(
    ggdensity(var, fill = "grey", alpha = 0.3) + 
      geom_histogram(colour = 1, fill = "white", 
                     position = "identity", alpha = 0) + 
      ggtitle("original") + 
      theme(plot.title = element_text(size = 11)),
    ggdensity(var_T, fill = "grey", alpha = 0.3) + 
      geom_histogram(colour = 1, fill = "white", 
                     position = "identity", alpha = 0) +  
      ggtitle("transformed") + 
      theme(plot.title = element_text(size = 11))) 
  g <- annotate_figure(g, top = text_grob(substring(deparse(substitute(var)),3), size = 11))
  l <- list(g, paste("lambda = ", lambda))
  return(l)
}

This unfortunately doesn´t work:

Error in model.frame.default(formula = var ~ 1, drop.unused.levels = TRUE) : 
object is not a matrix

I tried several things, but nothing works, and it seems that the problem is somehow with boxcox() not being able to deal with a linear model which was created within the function, cause I get the same error even in the simple example:

library(EnvStats)

testt <- function(var){
  boxcox(lm(var ~ 1))
}

edit: trying to include the data argument in the lm() function also didn´t seem to work:

testt <- function(data, var){
  data %>%
    pull(var) -> dvar
  lmvar <- lm(data = data, formula = dvar ~ 1)
  boxcox(lmvar)
}

-> also no good:

Error in model.frame.default(formula = (data %>% pull(var)) ~ 1, data = data, : 
'data' must be a data.frame, environment, or list

(the data is a dataframe)

Any ideas?

Thanks a lot in advance!

Guy

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
guy
  • 23
  • 3
  • When using a non-base function such as `boxcox`, start the scripts with calls to `library()` to load the required packages. Is it package `EnvStats`? If so, edit the question with that information, please. – Rui Barradas Aug 26 '22 at 14:40
  • 2
    From the documentation, my emphasis: *When x is an object of class "lm", the object must have been created with a call to the function lm that includes **the data argument.*** – Rui Barradas Aug 26 '22 at 14:43
  • hmm, tried it, but it didn´t seem to work - or did I not understand you correctly? (I just added the data as an argument to the function and added it in as an argument to the lm() function?) – guy Aug 26 '22 at 16:29
  • Aren't calling `boxcox` to compute a optimal value for lambda? – Rui Barradas Aug 26 '22 at 18:28

1 Answers1

0

Here is a solution.
The following function uses reformulate to put the formula's pieces together and corrects the lm call, apparently, boxcox needs the model.matrix x and the response y and these values default to FALSE, see the documentation for lm, section Arguments.

The column name var can be quoted or not.

suppressPackageStartupMessages(
  library(EnvStats)
)

testt <- function(data, var){
  fmla <- reformulate("1", var)
  lmvar <- lm(formula = fmla, data = data, x = TRUE, y = TRUE)
  boxcox(lmvar)
}

testt(iris, "Sepal.Length")
#> $lambda
#> [1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0
#> 
#> $objective
#> [1] 0.9839965 0.9881195 0.9910533 0.9927322 0.9931009 0.9921171 0.9897540
#> [8] 0.9860027 0.9808736
#> 
#> $objective.name
#> [1] "PPCC"
#> 
#> $optimize
#> [1] FALSE
#> 
#> $optimize.bounds
#> lower upper 
#>    NA    NA 
#> 
#> $eps
#> [1] 2.220446e-16
#> 
#> $lm.obj
#> 
#> Call:
#> lm(formula = fmla, data = data, x = TRUE, y = TRUE)
#> 
#> Coefficients:
#> (Intercept)  
#>       5.843  
#> 
#> 
#> $sample.size
#> [1] 150
#> 
#> $data.name
#> [1] "lmvar"
#> 
#> attr(,"class")
#> [1] "boxcoxLm"

# same output, omitted
testt(iris, Sepal.Length)

Created on 2022-08-26 by the reprex package (v2.0.1)


Edit

An argument optimize could be useful whenever the optimal lambda is needed.

testt <- function(data, var, optimize = FALSE){
  var <- as.character(substitute(var))
  fmla <- reformulate("1", var)
  lmvar <- lm(formula = fmla, data = data, x = TRUE, y = TRUE)
  boxcox(lmvar, optimize = optimize)
}

testt(iris, Sepal.Length)$lambda
#[1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0

testt(iris, Sepal.Length, optimize = TRUE)$lambda
#[1] -0.1117881
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Amazing, thank you so much! Didn´t even need to reformulate, just added x=T, y=T and it worked. Thanks! – guy Aug 26 '22 at 20:46
  • @guy Glad it helped. And, I had thought of adding an argument `optimize` defaulting to FALSE. Whenever the optimal lambda is needed you could pass `optimize=TRUE`. I will edit the answer with this function version at the end. – Rui Barradas Aug 26 '22 at 20:49