40

I need to transform some data into a 'normal shape' and I read that Box-Cox can identify the exponent to use to transform the data.

For what I understood

car::boxCoxVariable(y)

is used for response variables in linear models, and

MASS::boxcox(object)

for a formula or fitted model object. So, because my data are the variable of a dataframe, the only function I found I could use is:

car::powerTransform(dataframe$variable, family="bcPower")

Is that correct? Or am I missing something?

The second question is about what to do after I obtain the

Estimated transformation parameters
dataframe$variable
0.6394806

Should I simply multiply the variable by this value? I did so:

aaa = 0.6394806
dataframe$variable2 = (dataframe$variable)*aaa

and then I run the shapiro-wilks test for normality, but again my data don't seem to follow a normal distribution:

shapiro.test(dataframe$variable2)
data:  dataframe$variable2
W = 0.97508, p-value < 2.2e-16
Linus Fernandes
  • 498
  • 5
  • 30
dede
  • 1,129
  • 5
  • 15
  • 35
  • 3
    I found to be a good documentation with clear R code and examples (and also for other transformations) the chapter [Transforming Data](https://rcompanion.org/handbook/I_12.html) in the handbook [Summary and Analysis of Extension Program Evaluation in R](https://rcompanion.org/handbook/) – Valentin_Ștefan Aug 11 '18 at 20:38
  • @Valentin very good explanation provided in the book mentioned. Many Thanks! – Rohit parihar May 02 '20 at 14:32

4 Answers4

40

Box and Cox (1964) suggested a family of transformations designed to reduce nonnormality of the errors in a linear model. In turns out that in doing this, it often reduces non-linearity as well.

Here is a nice summary of the original work and all the work that's been done since: http://www.ime.usp.br/~abe/lista/pdfm9cJKUmFZp.pdf

You will notice, however, that the log-likelihood function governing the selection of the lambda power transform is dependent on the residual sum of squares of an underlying model (no LaTeX on SO -- see the reference), so no transformation can be applied without a model.

A typical application is as follows:

library(MASS)

# generate some data
set.seed(1)
n <- 100
x <- runif(n, 1, 5)
y <- x^3 + rnorm(n)

# run a linear model
m <- lm(y ~ x)

# run the box-cox transformation
bc <- boxcox(y ~ x)

enter image description here

(lambda <- bc$x[which.max(bc$y)])
[1] 0.4242424

powerTransform <- function(y, lambda1, lambda2 = NULL, method = "boxcox") {

  boxcoxTrans <- function(x, lam1, lam2 = NULL) {

    # if we set lambda2 to zero, it becomes the one parameter transformation
    lam2 <- ifelse(is.null(lam2), 0, lam2)

    if (lam1 == 0L) {
      log(y + lam2)
    } else {
      (((y + lam2)^lam1) - 1) / lam1
    }
  }

  switch(method
         , boxcox = boxcoxTrans(y, lambda1, lambda2)
         , tukey = y^lambda1
  )
}


# re-run with transformation
mnew <- lm(powerTransform(y, lambda) ~ x)

# QQ-plot
op <- par(pty = "s", mfrow = c(1, 2))
qqnorm(m$residuals); qqline(m$residuals)
qqnorm(mnew$residuals); qqline(mnew$residuals)
par(op)

enter image description here

As you can see this is no magic bullet -- only some data can be effectively transformed (usually a lambda less than -2 or greater than 2 is a sign you should not be using the method). As with any statistical method, use with caution before implementing.

To use the two parameter Box-Cox transformation, use the geoR package to find the lambdas:

library("geoR")
bc2 <- boxcoxfit(x, y, lambda2 = TRUE)

lambda1 <- bc2$lambda[1]
lambda2 <- bc2$lambda[2]

EDITS: Conflation of Tukey and Box-Cox implementation as pointed out by @Yui-Shiuan fixed.

mlegge
  • 6,763
  • 3
  • 40
  • 67
  • 1
    You might point out that it's possible to get the same answer using `MASS` by using the model `lm(y ~ 1)` (in this case, `bc <- boxcox(variable ~ 1, data=dataframe)`). `powerTransform()` gives the "right" lambda, but the data have things going on that make it impossible to force normality using just Box-Cox. – Sam Dickson Nov 30 '15 at 16:17
  • Great answer! May I ask why you emphasized "errors" in the beginning of your answer? Is this because one is to transform the response variable (e.g. y in y ~ x_1 + x_2) and not the covariates (x_1 or x_2), or can one transform the covariates as well? – Helen Jan 09 '20 at 14:32
22

According to the Box-cox transformation formula in the paper Box,George E. P.; Cox,D.R.(1964). "An analysis of transformations", I think mlegge's post might need to be slightly edited.The transformed y should be (y^(lambda)-1)/lambda instead of y^(lambda). (Actually, y^(lambda) is called Tukey transformation, which is another distinct transformation formula.)
So, the code should be:

(trans <- bc$x[which.max(bc$y)])
[1] 0.4242424
# re-run with transformation
mnew <- lm(((y^trans-1)/trans) ~ x) # Instead of mnew <- lm(y^trans ~ x) 

More information

Please correct me if I misunderstood it.

  • Thank you for pointing this out (with excellent documentation!). I've updated my answer to try to address this – mlegge Jun 02 '17 at 14:55
5

If I want tranfer only the response variable y instead of a linear model with x specified, eg I wanna transfer/normalize a list of data, I can take 1 for x, then the object becomes a linear model:

library(MASS)
y = rf(500,30,30)
hist(y,breaks = 12)
result = boxcox(y~1, lambda = seq(-5,5,0.5))
mylambda = result$x[which.max(result$y)]
mylambda
y2 = (y^mylambda-1)/mylambda
hist(y2)
Hongzhen Tian
  • 51
  • 1
  • 2
  • had 4 different variables with all 4 histograms showing nonnormal distribution, this solution helped me to bring them to a normal distribution individually. – Arvind Reddy May 13 '21 at 17:05
1

Applying the BoxCox transformation to data, without the need of any underlying model, can be done currently using the package geoR. Specifically, you can use the function boxcoxfit() for finding the best parameter and then predict the transformed variables using the function BCtransform().

Mike.forest
  • 171
  • 1
  • 2