3

I've been using a code to run means for specific variable values (demographic breaks), however I now have data that has a weight variable and need to calculate weighted means. I've already been using a code to calculate sample means, and was wondering if it's possible to change change or adjust the function to calculate the weighted mean. Here is some code to generate sample data

df <- data.frame(gender=c(2,2,1,1,2,2,1,1,1,1,1,1,2,2,2,2,1,2,2,1),
                 agegroup=c(2,2,7,5,5,5,2,7,2,2,4,4,4,3,4,5,3,3,6,6),
                 attitude_1=c(4,3,4,4,4,4,4,4,5,2,5,5,5,4,3,2,3,4,2,4),
                 attitude_2=c(4,4,1,3,4,2,4,5,5,5,5,4,5,4,3,3,4,4,4,4),
                 attitude_3=c(2,2,1,1,3,2,5,1,4,2,2,2,3,3,4,1,4,1,3,1),
                 income=c(40794,74579,62809,47280,72056,57908,70784,96742,66629,117530,79547,54110,39569,111217,109146,56421,106206,28385,85830,71110),
                 weight=c(1.77,1.89,2.29,6.14,2.07,5.03,0.73,1.60,1.95,2.56,5.41,2.02,6.87,3.23,3.01,4.68,3.42,2.75,2.31,4.04))

So far I've been using this code to get sample means

assign("Gender_Profile_1", 
       data.frame(sapply(subset(df, gender==1), FUN = function(x) mean(x, na.rm = TRUE))))

> Gender_Profile_1
           sapply.subset.df..gender....1...FUN...function.x..mean.x..na.rm...TRUE..
gender                                                                        1.000
agegroup                                                                      4.200
attitude_1                                                                    4.000
attitude_2                                                                    4.000
attitude_3                                                                    2.300
income                                                                    77274.700
weight                                                                        3.016

As you can see it generates Gender_Profile_1 with the means for all variables. In my attempt to calculate the weighted mean, I've tried to change the "FUN=" part to this

assign("Gender_Profile_1", 
       data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))

I get the following error message

 Error in weighted.mean.default(x, w = weight, na.rm = TRUE) : 
  'x' and 'w' must have the same length 

I've been trying all kinds of permutations of df$weight and df$x, but nothing seems to work. Any help or ideas would be great. Many thanks

H.Cheung
  • 855
  • 5
  • 12

3 Answers3

4

Base R

If you want to stick to base R, you can do the following:

# define func to return all weighted means
all_wmeans <- function(data_subset) {

  # which cols to summarise? all but gender and weight
  summ_cols <- setdiff(names(data_subset), c('gender', 'weight'))

  # for each col, calc weighted mean with weights from the 'weight' column
  result <- lapply(data_subset[, summ_cols], 
                   weighted.mean, w=data_subset$weight)

  # squeeze the resuling list back to a data.frame and return
  return(data.frame(result))
}

# now, split the df on gender, and apply the func to each chunk
lapply(split(df, df$gender), all_wmeans)

The result is a list of two data frames, for each value of gender:

$`1`
  agegroup attitude_1 attitude_2 attitude_3   income
1 4.397546   4.027851   3.950597   1.962202 74985.25

$`2`
  agegroup attitude_1 attitude_2 attitude_3   income
1 4.092234   3.642666   3.676287   2.388872 64075.23

The fabulous data.table

If you don't mind using packages, dplyr and data.table are great packages that make this kind of stuff much simpler. Here's data.table:

# load library and create a data.table object
library(data.table)
my_dt <- data.table(df)

# now it's a one liner:
my_dt[, lapply(.SD, weighted.mean, w=.SD$weight), by=gender]

which returns:

   gender agegroup attitude_1 attitude_2 attitude_3   income   weight
1:      2 4.092234   3.642666   3.676287   2.388872 64075.23 4.099426
2:      1 4.397546   4.027851   3.950597   1.962202 74985.25 3.904483

The data.table code also groups the rows by gender, and uses lapply to apply a function and extra argument to each Subset of Data (that's what the .SD call is). Conceptually, it's the exact same as the base R code, just compact and fast.

arvi1000
  • 9,393
  • 2
  • 42
  • 52
2

You can do the whole lot at once like this:

sapply(1:2, function(y) 
  sapply(subset(df, df$gender == y), function(x) 
    weighted.mean(x, df$weight[df$gender == y])))
#>                    [,1]         [,2]
#> gender         1.000000     2.000000
#> agegroup       4.397546     4.092234
#> attitude_1     4.027851     3.642666
#> attitude_2     3.950597     3.676287
#> attitude_3     1.962202     2.388872
#> income     74985.247679 64075.232966
#> weight         3.904483     4.099426

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
1

I think the main problem with your code is that you are calling the weights column inside the sapply loop, however, this column has not been subsetted (as df has). Thus, you could just subset the weights columns before the sapply and then loop using that subsetted weights.

Using the code you posted:

weight <- subset(df, gender==1)[,"weight"]
#Exactly the same code you posted
assign("Gender_Profile_2", 
       data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))

Here is another solution using apply, that might be easier to implement:

#Apply the desired function by columns
apply(subset(df, gender==1), 2, FUN = function(x) mean(x, na.rm = TRUE))
#Get the weights of the rows that have gender == 1
weight <- subset(df, gender==1)[,7]
#Apply the wighted mean function
apply(subset(df[,-7], gender==1), 2, FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))