4

I've been working on some survey data using the survey package. I read the documentation available on post-stratification and calibration, however I got stuck trying to calibrate the sampling weights on a total known for the population that is not the population total.

To make my self clear I prepared an example: Let's say I have income information for a sample stratified by sex, which lets me create the svydesign object:

data = data.frame(id = c(1:5),
              sex = c("F","F","F","C","C"),
              income = c(100,150,75,200,100),
              sw = c(2,2,3,3,3))

dis = svydesign(ids = ~id,
                strata = ~ sex,
                weights = ~sw,
                data = data)

Then I can calculate the total income by sex with:

    svyby(~income,~ sex,dis,svytotal)

  gender income        se
F      F    725  90.13878
M      M    900 300.00000

However, I don't know how many males or females are in the population but I do know total income by sex:

  gender income
     F    800
     M    800

I haven't been able to find a way of using the calibrate or postStratify functions to get this estimations of the totals by sex correctly with the se = 0 (i.e. calibrating(post-stratifying) a survey design with a total different from the total population by group).

I know I could calibrate the sampling weights by multiplying by the ratio calibration factor (dividing the estimated total over the population known total by sex). This approach has some limitations as stated here, since I would get the point estimations right but not the standard errors.

Thanks in advance for reading this! Any suggestions would be appreciated. :)

christk
  • 87
  • 4
  • 1
    why do you estimate income from a sample if you have information about income in the population? – Yuriy Saraykin Feb 10 '22 at 08:03
  • It was just for the example. I don’t want just to estimate income. There are a bunch of variables in the full survey, but the total income is the only auxiliary information I have available for calibration. – christk Feb 10 '22 at 13:42
  • sounds like you have a prior and could use a bayesian approach? – mnist Feb 15 '22 at 21:25

2 Answers2

2

I think you can use calibration on this, but remember that there is a model doing the work behind the scenes. As any model in R, you have to work with formula objects. In order to do that, I'd do this:

library( survey )

data = data.frame(id = c(1:5),
                  sex = c("F","F","F","M","M"),
                  income = c(100,150,75,200,100),
                  sw = c(2,2,3,3,3))

dis = svydesign(ids = ~id,
                strata = ~ sex,
                weights = ~sw,
                data = data)

(I changed the "C" to "M" in the sex variable to make sense with the totals "labels".) At this point, run the calibration:

dis.cal <- calibrate(dis, ~-1+sex:income , c( `sexM:income`=800 , `sexF:income` = 800 )

Now, let's compare the results. For the original survey design object, we had:

> svyby(~income,~ sex,dis, svytotal)
  sex income        se
F   F    725  90.13878
M   M    900 300.00000

Now, after calibration, we have:

> svyby(~income,~ sex,dis.cal , svytotal)
  sex income           se
F   F    800 5.413807e-14
M   M    800 1.180346e-13

The SEs are pratically zero, as we would expect.

That said, I'd watch out for the actual scenario in which you're applying this technique. For instance, measurement errors, small samples and other issues might be problematic. You can even lose some efficiency if the study variable is not correlated with the auxiliary variables. I suggest reading Deville and Sarndal (1992), the calibration chapter in Lumley's (2011) book and Nascimento Silva's working paper.

Guilherme Jacob
  • 691
  • 3
  • 7
  • This is exactly what I was looking for! Thank you very much. The real scenario is a little bit trickier: we have a survey where areas are sampled so we know the total of cultivated areas rather than the total of farmers. However I will consider those references for the document, as you recommended. – christk Feb 19 '22 at 16:53
-1

Here is a workaround.

all your data is stored at dis$variables, from there you can export it and make your calculations. I hope this can inspire better solutions

library(dplyr)    
dis$variables %>%
      group_by(sex) %>% 
      summarize(sw_sum = sum(sw),
                n_sex = n()) %>%
      ungroup() %>% 
      mutate(total_sex = sw_sum*n_sex) %>% 
      select(sex, total_sex)

output

enter image description here

Ruam Pimentel
  • 1,288
  • 4
  • 16
  • 1
    thanks for taking the time to answer this but i'm not sure it's related to calibration of a complex-sample survey.. – Anthony Damico Feb 17 '22 at 09:58
  • I see. I had the impression I didn’t understand the question very much. Anyway. As you see I’m knew to SO. so here is a quick question. In this case that my answer is not related to the question, should I live it there or delete it? – Ruam Pimentel Feb 17 '22 at 14:43