2

I'm trying to run a crosstab/contingency table, but need it weighted by a weighting variable. Here is some sample data.

set.seed(123)
sex <- sample(c("Male", "Female"), 100, replace = TRUE)
age <- sample(c("0-15", "16-29", "30-44", "45+"), 100, replace = TRUE)
wgt <- sample(c(1:10), 100, replace = TRUE)
df <- data.frame(age,sex, wgt)

I've run this to get a regular crosstab table

table(df$sex, df$age)

to get a weighted frequency, I tried the Hmisc package (if you know a better package let me know)

library(Hmisc)
wtd.table(df$sex, df$age, weights=df$wgt)
Error in match.arg(type) : 'arg' must be of length 1

I'm not sure where I've gone wrong, but it doesn't run, so any help will be great. Alternatively, if you know how to do this in another package, which may be better for analysing survey data, that would be great too. Many thanks in advance.

H.Cheung
  • 855
  • 5
  • 12
  • **Just to add a note, the wgt variable can have decimals so it will need a inbuilt weighting function. thanks to anyone who responded using the rep function** – H.Cheung Oct 06 '20 at 15:23

4 Answers4

2

A solution is to repeat the rows of the data.frame by weight and then table the result.

The following repeats the data.frame's rows (only relevant columns):

df[rep(row.names(df), df$wgt), 1:2]

And it can be used to get the contingency table.

table(df[rep(row.names(df), df$wgt), 1:2])
#       sex
#age     Female Male
#  0-15      56   76
#  16-29     73   99
#  30-44     60  106
#  45+       76   90
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • 1
    Thanks, its a neat solution, but normally the wgt variable, would have decimals. I'm guessing the rep function wouldn't work, unless the wgt is an integer? – H.Cheung Oct 06 '20 at 15:09
  • This will work as does mine with non - integer weights – Chuck P Oct 06 '20 at 15:23
  • @Chuck P, it works for the test data but it's not working on a file with 30k+ rows of data. Cheers, i'll rewrite my question. I don't think rep is efficient – H.Cheung Oct 06 '20 at 15:42
  • it's too slow, it took about 6 mins. rep is not efficient. my rows are large 30k+ and the wgt values are large 3000-7000. this looks like it physically replicates the sample. My data is weighted to population, so it will be replicating the rows up to 50m? I'd rather use another solution than convert the weight down to then convert the number back up, but thanks again – H.Cheung Oct 06 '20 at 16:04
  • @H.Cheung If speed is an issue, @ekoam's [`GDAtools::wtable`](https://stackoverflow.com/a/64227791/8245406) solution is the fastest. – Rui Barradas Oct 06 '20 at 16:42
  • @Rui, the package wouldn't install. I've already put up another question and reworded it so it's clearer. Your code with Rep, i think that will be useful for me, in other circumstances. thank you for your help today. – H.Cheung Oct 06 '20 at 16:51
  • 2
    @H.Cheung The package name in the answer is wrong, it's with a lower case `t`: `GDAtools`. And it installs after this correction. – Rui Barradas Oct 06 '20 at 16:52
  • 1
    @H.Cheung In my tests `GDAtools::wtable` was 140 and 180 times faster than [Chuck P's](https://stackoverflow.com/a/64228313/8245406) and my solutions, respectively. The df had weights between 3000 and 5000. – Rui Barradas Oct 06 '20 at 17:13
2

Try this

GDAtools::wtable(df$sex, df$age, w = df$wgt)

Output

       0-15 16-29 30-44 45+ NA tot
Female   56    73    60  76  0 265
Male     76    99   106  90  0 371
NA        0     0     0   0  0   0
tot     132   172   166 166  0 636

Update

In case you do not want to install the whole package, here are two essential functions you need:

wtable and dichotom

Source them and you should be able to use wtable without any problem.

ekoam
  • 8,744
  • 1
  • 9
  • 22
  • just a quick question, is it possible to put a third level into this. When i try this it doesn't work GDAtools::wtable(df$sex, df$age, df$VAR3, w = df$wgt) - when i've created a third variable – H.Cheung Oct 06 '20 at 17:09
  • I don't think that works. You can only crosstab two variables each time. – ekoam Oct 06 '20 at 17:14
  • no worries about the 3 level. I can't write another question. i'll edit my other one, thanks you've answered the question here. an uptick for you. thanks – H.Cheung Oct 06 '20 at 17:15
2

Base R, in stats, has xtabs for exactly this:

xtabs(wgt ~ age + sex, data=df) 
dsz
  • 4,542
  • 39
  • 35
0

A tidyverse solution using your data same set.seed, uncount is the equivalent to @Rui's rep of the weights.

library(dplyr)
library(tidyr)

df %>%
   uncount(weights = .$wgt) %>% 
   select(-wgt) %>%
   table
#>        sex
#> age     Female Male
#>   0-15      56   76
#>   16-29     73   99
#>   30-44     60  106
#>   45+       76   90
Chuck P
  • 3,862
  • 3
  • 9
  • 20
  • if that is the case, would it work if the wgt variable values are not integers? Sorry i should have created a random wgt with decimal places – H.Cheung Oct 06 '20 at 15:11
  • 1
    Yes. I used `wgt <- runif(100, min = .5, max = 20)` to test and it did just fine – Chuck P Oct 06 '20 at 15:21