Collapse data frame, by group, using lists of variables for weighted average AND sum

Question

I want to collapse the following data frame, using both summation and weighted averages, according to groups.

I have the following data frame

group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)

df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)

I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).

I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).

to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]

to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)

And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).

Here is my best attempt:

 df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))

But, as you can see, it is not a weighted average

I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;

df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))

And the corresponding error:

Error in weighted.mean.default(to_be_weighted_2, weighting) : 
'x' and 'w' must have the same length

You should use `set.seed` before making random example data so we're all looking at the same thing and can verify potential answers. — Frank, May 04 '18 at 17:56
Anyway, if you're willing to use data.table, this sounds like what you're after `DT[, c(lapply(.SD[, ..to_be_weighted_2], weighted.mean, w = weighting), lapply(.SD[, ..to_be_summed_2], sum)), by=group_id]` — Frank, May 04 '18 at 18:00

score 0 · Accepted Answer · answered May 04 '18 at 18:35

Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.

The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.

library(tidyverse)

set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)

df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)

df_to_collapse %>%
    gather(key = var, value = value, -group_id, -weighting) %>%
    mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
    group_by(group_id, var) %>%
    summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
    ungroup() %>%
    spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#>   group_id var_1 var_2 var_percent_1 var_percent_2
#>      <dbl> <dbl> <dbl>         <dbl>         <dbl>
#> 1        1    26    31         0.269         0.483
#> 2        2    32    21         0.854         0.261
#> 3        3    29    49         0.461         0.262

Created on 2018-05-04 by the reprex package (v0.2.0).

Perfect! Many thanks for your help (and sorry for taking to long to mark as answered!) — SeánMcK, Jun 02 '18 at 19:54

Collapse data frame, by group, using lists of variables for weighted average AND sum

1 Answers1