how to use the Box-Cox power transformation in R

Question

I need to transform some data into a 'normal shape' and I read that Box-Cox can identify the exponent to use to transform the data.

For what I understood

car::boxCoxVariable(y)

is used for response variables in linear models, and

MASS::boxcox(object)

for a formula or fitted model object. So, because my data are the variable of a dataframe, the only function I found I could use is:

car::powerTransform(dataframe$variable, family="bcPower")

Is that correct? Or am I missing something?

The second question is about what to do after I obtain the

Estimated transformation parameters
dataframe$variable
0.6394806

Should I simply multiply the variable by this value? I did so:

aaa = 0.6394806
dataframe$variable2 = (dataframe$variable)*aaa

and then I run the shapiro-wilks test for normality, but again my data don't seem to follow a normal distribution:

shapiro.test(dataframe$variable2)
data:  dataframe$variable2
W = 0.97508, p-value < 2.2e-16

I found to be a good documentation with clear R code and examples (and also for other transformations) the chapter [Transforming Data](https://rcompanion.org/handbook/I_12.html) in the handbook [Summary and Analysis of Extension Program Evaluation in R](https://rcompanion.org/handbook/) — Valentin_Ștefan, Aug 11 '18 at 20:38
@Valentin very good explanation provided in the book mentioned. Many Thanks! — Rohit parihar, May 02 '20 at 14:32

mlegge · Answer 1 · 2017-06-02T14:48:55.733

Box and Cox (1964) suggested a family of transformations designed to reduce nonnormality of the errors in a linear model. In turns out that in doing this, it often reduces non-linearity as well.

Here is a nice summary of the original work and all the work that's been done since: http://www.ime.usp.br/~abe/lista/pdfm9cJKUmFZp.pdf

You will notice, however, that the log-likelihood function governing the selection of the lambda power transform is dependent on the residual sum of squares of an underlying model (no LaTeX on SO -- see the reference), so no transformation can be applied without a model.

A typical application is as follows:

library(MASS)

# generate some data
set.seed(1)
n <- 100
x <- runif(n, 1, 5)
y <- x^3 + rnorm(n)

# run a linear model
m <- lm(y ~ x)

# run the box-cox transformation
bc <- boxcox(y ~ x)

(lambda <- bc$x[which.max(bc$y)])
[1] 0.4242424

powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") {

  boxcoxTrans <- function(x, lam1, lam2 = NULL) {

    # if we set lambda2 to zero, it becomes the one parameter transformation
    lam2 <- ifelse(is.null(lam2), 0, lam2)

    if (lam1 == 0L) {
      log(y + lam2)
    } else {
      (((y + lam2)^lam1) - 1) / lam1
    }
  }

  switch(method
         , boxcox = boxcoxTrans(y, lambda1, lambda2)
         , tukey = y^lambda1
  )
}


# re-run with transformation
mnew <- lm(powerTransform(y, lambda) ~ x)

# QQ-plot
op <- par(pty = "s", mfrow = c(1, 2))
qqnorm(m$residuals); qqline(m$residuals)
qqnorm(mnew$residuals); qqline(mnew$residuals)
par(op)

As you can see this is no magic bullet -- only some data can be effectively transformed (usually a lambda less than -2 or greater than 2 is a sign you should not be using the method). As with any statistical method, use with caution before implementing.

To use the two parameter Box-Cox transformation, use the geoR package to find the lambdas:

library("geoR")
bc2 <- boxcoxfit(x, y, lambda2 = TRUE)

lambda1 <- bc2$lambda[1]
lambda2 <- bc2$lambda[2]

EDITS: Conflation of Tukey and Box-Cox implementation as pointed out by @Yui-Shiuan fixed.

You might point out that it's possible to get the same answer using `MASS` by using the model `lm(y ~ 1)` (in this case, `bc <- boxcox(variable ~ 1, data=dataframe)`). `powerTransform()` gives the "right" lambda, but the data have things going on that make it impossible to force normality using just Box-Cox. — Sam Dickson, Nov 30 '15 at 16:17
Great answer! May I ask why you emphasized "errors" in the beginning of your answer? Is this because one is to transform the response variable (e.g. y in y ~ x_1 + x_2) and not the covariates (x_1 or x_2), or can one transform the covariates as well? — Helen, Jan 09 '20 at 14:32

Sean Yun-Shiuan Chuang · Answer 2 · 2017-05-24T15:48:10.357

According to the Box-cox transformation formula in the paper Box,George E. P.; Cox,D.R.(1964). "An analysis of transformations", I think mlegge's post might need to be slightly edited.The transformed y should be (y^(lambda)-1)/lambda instead of y^(lambda). (Actually, y^(lambda) is called Tukey transformation, which is another distinct transformation formula.)
So, the code should be:

(trans <- bc$x[which.max(bc$y)])
[1] 0.4242424
# re-run with transformation
mnew <- lm(((y^trans-1)/trans) ~ x) # Instead of mnew <- lm(y^trans ~ x)

More information

Correct implementation of Box-Cox transformation formula by boxcox() in R:
https://www.r-bloggers.com/on-box-cox-transform-in-regression-models/
A great comparison between Box-Cox transformation and Tukey transformation. http://onlinestatbook.com/2/transformations/box-cox.html
One could also find the Box-Cox transformation formula on Wikipedia: en.wikipedia.org/wiki/Power_transform#Box.E2.80.93Cox_transformation

Please correct me if I misunderstood it.

Thank you for pointing this out (with excellent documentation!). I've updated my answer to try to address this — mlegge, Jun 02 '17 at 14:55

score 5 · Answer 3 · answered Jun 08 '20 at 21:52

5

If I want tranfer only the response variable y instead of a linear model with x specified, eg I wanna transfer/normalize a list of data, I can take 1 for x, then the object becomes a linear model:

library(MASS)
y = rf(500,30,30)
hist(y,breaks = 12)
result = boxcox(y~1, lambda = seq(-5,5,0.5))
mylambda = result$x[which.max(result$y)]
mylambda
y2 = (y^mylambda-1)/mylambda
hist(y2)

answered Jun 08 '20 at 21:52

Hongzhen Tian

51
1
2

had 4 different variables with all 4 histograms showing nonnormal distribution, this solution helped me to bring them to a normal distribution individually. – Arvind Reddy May 13 '21 at 17:05

score 1 · Answer 4 · answered Apr 01 '20 at 10:58

Applying the BoxCox transformation to data, without the need of any underlying model, can be done currently using the package geoR. Specifically, you can use the function boxcoxfit() for finding the best parameter and then predict the transformed variables using the function BCtransform().

how to use the Box-Cox power transformation in R

4 Answers4

More information

Linked