2

Let's say I have a response variable which is not normally distributed and an explanatory variable. Let's create these two variables first (coded in R):

set.seed(12)
resp = (rnorm(120)+20)^3.79
expl = rep(c(1,2,3,4),30)

I run a linear model and I realize that the residuals are not normally distributed. (I know running a Shapiro might not be enough to justify that the residuals are not normally distributed but it is not the point of my question)

m1=lm(resp~expl)
shapiro.test(residuals(m1))
0.01794

Therefore I want to transform my explanatory variable (looking for a transformation with a Box-Cox for example).

m2=lm(resp^(1/3.79)~expl)
shapiro.test(residuals(m2))
0.4945

Ok, now my residuals are normally distributed it is fine! I now want to make a graphical representation of my data and my model. But I do not want to plot my explanatory variable in the transformed form because I would lose lots of its intuitive meaning. Therefore I do:

plot(x=expl,y=resp)

What if I now want to add the model? I could do this

abline(m2) # m2 is the model with transformed variable

but of course the line does not fit the data represented. I could do this:

abline(m1) # m1 is the model with the original variable.

but it is not the model I ran for the statistics! How can I re-transform the line predicted by m2 so that it fits the data?

Remi.b
  • 17,389
  • 28
  • 87
  • 168

2 Answers2

2
plotexpl <- seq(1,4,length.out=10)
predresp <- predict(m2,newdata=list(expl=plotexpl))

lines(plotexpl, predresp^(3.79))

I won't discuss the statistical issues here (e.g. a non-significant test does not mean that H0 is true and your model is not better than the mean).

Roland
  • 127,288
  • 10
  • 191
  • 288
  • Thank you! I don't understand at 100% how it works though. Does it work as well if `expl` is non-linear.? example: `expl = c(c(1,2,32,40),30)`. I guess it should. Does it change anything to the predicted line if I use 3 points or 100 points (in other words, if `length.out` equals 3 or 100)? – Remi.b Aug 29 '13 at 09:45
  • This code uses `m2` to get predicted values for new `expl` values and then transforms them back to the original scale. The higher the number of predicted data points, the smoother the line looks (as `lines` interpolates between points). Whether the distribution of `expl` is uniform or not (I guess that's what you mean with "non-linear") doesn't matter for plotting. – Roland Aug 29 '13 at 11:55
0

Since you've mentioned that the transformation might base on Box-Cox formula, I would like to point out a issue you might want to consider.

According to the Box-cox transformation formula in the paper Box,George E. P.; Cox,D.R.(1964). "An analysis of transformations", your transformation implementation (in case it is a Box-Cox one) might need to be slightly edited.The transformed y should be (y^(lambda)-1)/lambda instead of y^(lambda). (Actually, y^(lambda) is called Tukey transformation, which is another distinct transformation formula.)
So, the code should be:

lambda=3.79
m2=lm(resp^((lambda-1)/lambda)~expl)
shapiro.test(residuals(m2))

More information

Please correct me if I misunderstood your implementation.