0

Introduction

I have multilevel survey data of teachers nested in schools. I have manually calculated design weights and non-response adjustment weights based on probability selection and response rate (oldwt below). Now I want to create post-stratification weights by raking on two marginals: the sex (male or female) of and the employment status (full-time or not full-time) of the teacher. With the help of kind people at Statalist (see here), I have seemingly done this in Stata successfully. However, in trying to replicate the results in R, I come up with vastly different output.

Sample Data

#Variables
#school   : unique school id
#caseid   : unique teacher id
#oldwt    : the product of the design weight and the non-response adjustment
#gender   : male or female
#timecat  : employment status (full-time or part-time)
#scgender : a combined factor variable of school x gender
#sctime   : a combined factor variable of school x timecat
#genderp  : the school's true population for gender
#fullp    : the school's true population for timecat

#Sample Data
foo <- structure(list(caseid = 1:11, school = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), oldwt = c(1.8, 1.8, 1.8, 1.8, 1.8, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3), gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"), timecat = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Full-time", "Part-time"), class = "factor"), scgender = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("1.Female", "1.Male", "2.Female", "2.Male"), class = "factor"), sctime = structure(c(2L, 2L, 1L, 1L, 1L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time"), class = "factor"), genderp = c(0.444, 0.556, 0.556, 0.444, 0.444, 0.25, 0.75, 0.75, 0.25, 0.75, 0.75), fullp = c(0.222, 0.222, 0.778, 0.778, 0.778, 0.375, 0.375, 0.625, 0.625, 0.625, 0.625)), .Names = c("caseid", "school", "oldwt", "gender", "timecat", "scgender", "sctime", "genderp", "fullp"), class = "data.frame", row.names = c(NA, -11L))

Raking Code

(See here and here for in-depth examples of using anesrake in R).

# extract true population proportions into a vector
genderp <- c(aggregate(foo$genderp, by=list(foo$scgender), FUN=max))
fullp <- c(aggregate(foo$fullp, by=list(foo$sctime), FUN=max))
genderp <- as.vector(genderp$x)
fullp <- as.vector(fullp$x)

# align the levels/labels of the population total with the variables
names(genderp) <- c("1.Female", "1.Male", "2.Female", "2.Male")
names(fullp) <- c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time")

# create target list of true population proportions for variables
targets <- list(genderp, fullp)
names(targets) <- c("scgender", "sctime") 

# rake
library(anesrake)
outsave <- anesrake(targets, foo, caseid = foo$caseid, weightvec = foo$oldwt, verbose = F, choosemethod = "total", type = "nolim", nlim = 2, force1 = FALSE)
outsave

Comparison with Stata Output

The issue is that the output from R doesn't match up with the output with Stata (even if I set force1 = TRUE), and it seems that the Stata output is the one that is right, making me think my sloppy R code is wrong. Is that the case?

caseid    R      Stata 
  1     0.070    0.633
  2     0.152    1.367
  3     0.404    3.633
  4     0.187    1.683
  5     0.187    1.683
  6     0.143    1.146
  7     0.232    1.854
  8     0.173    1.382
  9     0.107    0.854
 10     0.173    1.382
 11     0.173    1.382
coip
  • 1,312
  • 16
  • 30
  • If you don't get a relevant answer here, I'd recommend www.statalist.org, where you're likely to find a bigger group of experts on survey methodology. – Roberto Ferrer Jun 29 '15 at 03:25
  • Cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1300504-raking-weights-on-nested-data-how-to-rake-by-group – Nick Cox Jun 29 '15 at 20:08
  • Per @RobertoFerrer 's suggestion, I posted on Statalist (linked above by @NickCox). Rather than re-posting that answer here, I attempted to replicate it in R, only to find discrepancies in the output (probably due to my poor R skills). So, I thought it would be better to edit the original question to focus on this mismatch instead. – coip Jul 04 '15 at 00:44

1 Answers1

1

The distribution of your targets in R should sum up one and represent the distribution in your population. Look at my example. I think that the force1 option will not compute the distribution you want at least each school has the same population weight. This is what force1 is doing:

targets[[1]]/sum(targets[[1]]) 1.Female 1.Male 2.Female 2.Male 0.278 0.222 0.125 0.375

Is that what you want?

sdaza
  • 1,032
  • 13
  • 29
  • "[Y]our targets in R should sum up one". I see. That's different than Stata, where they sum up to the actual population totals (e.g. the original sample data linked at Statalist above show n=9 for School 1 and n=8 for School 2, and the raked weights displayed above sum to 17 (9+8) for the Stata output. But you're right that the R output sums to 2 (1 + 1 for each school). I guess Stata's method (i.e. equaling the actual population totals) made more sense to me, which is why I thought I erred with my R code; I'm wondering what the implications of these different methods are. – coip Jul 11 '15 at 19:25
  • You can replicate what Stata produces in R, you just have to adjust the proportions based on the number of teachers per school. I think it is only an implementation difference, not a different method. – sdaza Jul 11 '15 at 20:04