4

Does anyone know of a way to estimate Box-Cox multivariate transformations with survey data in R? I'm not aware of anything that takes into account strata and clusters (the data that I'm working with), but even something that takes into account probability weights would be great.I'm mostly worried about the fact that the distribution of one or more variables may change when probability weights are applied, so the transformation may change radically. There may also be implications for errors and the Box-Cox algorithm etc... but this is beyond what is basically a theory-confirmation approach.

Updated question

The R function powerTransform works great, but I don't think there's anything yet for survey data. I thought Stata could handle this but as Nick pointed out this is not the case. The only Box-Cox transformation which handles sampling weights seems to be this.

Are you aware of any R function that allows you to apply both univariate and multivariate BoxCox transformations to probability weighted data?

I don't have any data but I was just wondering if anyone had found a solution to this. I know people appreciate when a specific example is given so...

Univariate Box-Cox: Results are returned for univariate Box-Cox when using lm and svyglm (survey package) objects.

library(survey)
data(api)
library(car)
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
Sur<-svyglm(api00~mobility, design=dstrat)
NotSur<-lm(api00~mobility, data=apistrat)
powerTransform(Sur)
powerTransform(NotSur)

However I don't think the powerTransformation with the survey object is correct because you get the same results as NotSur (and different from Sur) when you run

None<-svydesign(id=~1, weights=rep(1,nrow(apistrat)), data=apistrat, )
Sur2<-svyglm(api00~mobility, design=None)
powerTransform(Sur2)

I'm even less sure about how you would find multivariate normality as you'd have to use actual data e.g.

summary(powerTransform(cbind(api00,mobility)~1,apistrat))
Mercelo
  • 231
  • 1
  • 5
  • 14
  • Your statement about Stata is incorrect. The Stata command `boxcox` (not a function) does not support survey weights. See http://www.stata.com/help.cgi?boxcox which is public regardless of whether anyone has access to a copy of Stata. There is some support for weights in `boxcox`. I am puzzled that anyone wants to take the results of any Box-Cox procedure exactly. It's most appropriate as indicating a possible transformed scale or non-identity link function, which should always be consistent with what else you know about the data and the associated science. I can't comment on R. – Nick Cox Feb 14 '13 at 10:02
  • Nick.Thank you for your comments and link. I've updated my question. – Mercelo Feb 14 '13 at 10:14
  • http://rinantipodes.blogspot.com/2011/12/nutrient-intake-data-mixed-methods.html – Anthony Damico Feb 16 '13 at 13:43

2 Answers2

2

the link you have given appears to be to a user-defined function in SAS that is running within a data step. It should be possible to reprogram the method into R.

If you look at the suggested SAS method here, you'll see it uses proc transreg to estimate the power transformation required. That SAS proc does not accept survey weights. I am not sure what the weight option does in that proc see here

Update: I had a closer look at the first link you gave here. It appears that the weighting is being done in proc univariate with the weight option activated if the data contains weights. However, if you look at the detail for weight from here, you'll see that the weights are used to manipulate the variances. I'm not sure that you want to run with that assumption for your data.

Michelle
  • 1,281
  • 2
  • 16
  • 31
  • Michelle, many thanks for looking into this (Unfortunately I can't vote your answer because apparently I don't have enough credits). – Mercelo Feb 25 '13 at 18:53
  • I haven't found any good references for using weighting with nonlinear mixed methods. I worry about the effects on the within- and between-person variances as the survey weights tend to be very large relative to the sample size, so the number of "replicates" created by the weights is extremely large relative to the sample. I have tried and failed to find a reference on how to deal with weights appropriately, given the output distribution is also weighted. – Michelle Apr 02 '13 at 20:04
0

Using weights as in your linked SAS macro should give a good point estimate of the optimal transformation but is likely to give an unreasonable interval estimate -- because the log likelihood ratio will not have the standard chi-squared distribution.

Scaling the weights to sum to the sample size would probably give a ballpark-correct interval, but a proper design-based analogue of the Box & Cox method would need the sampling distribution of the 'working' likelihood ratio (as used by the AIC and anova methods for survey::svyglm)

Thomas Lumley
  • 251
  • 2
  • 4