2

I'm trying to get the weighted row mean of certain columns by group.

library(tidyverse)
set.seed(1)
df <- data.frame(group = rep(LETTERS[3:4], each = 10),
           x = runif(n = 10, min = 10, max = 15),
           y = runif(n = 10, min = 100, max = 150),
           z = runif(n = 10, min = 100, max = 150))
df
#    group        x        y        z
# 1      C 11.32754 110.2987 146.7353
# 2      C 11.86062 108.8278 110.6071
# 3      C 12.86427 134.3511 132.5837
# 4      C 14.54104 119.2052 106.2778
# 5      C 11.00841 138.4921 113.3610
# 6      C 14.49195 124.8850 119.3057
# 7      C 14.72338 135.8809 100.6695
# 8      C 13.30399 149.5953 119.1194
# 9      C 13.14557 119.0018 143.4845
# 10     C 10.30893 138.8723 117.0174
# 11     D 11.32754 110.2987 146.7353
# 12     D 11.86062 108.8278 110.6071
# 13     D 12.86427 134.3511 132.5837
# 14     D 14.54104 119.2052 106.2778
# 15     D 11.00841 138.4921 113.3610
# 16     D 14.49195 124.8850 119.3057
# 17     D 14.72338 135.8809 100.6695
# 18     D 13.30399 149.5953 119.1194
# 19     D 13.14557 119.0018 143.4845
# 20     D 10.30893 138.8723 117.0174

To get crude row mean of x, y, z, I can do:

df %>% 
  mutate(rmean = pmap_dbl(list(x, y, z), ~mean(c(...))))

But I want to weight them by these weights

dfweight <- data.frame(group = c("C", "C", "C",
                                 "D", "D", "D"),
                       cat = c("x", "y", "z", 
                               "x", "y", "z"),
                       weights = c(.2, .7, .1, 
                                   .4, .1, .5))
#   group cat weights
# 1     C   x     0.2
# 2     C   y     0.7
# 3     C   z     0.1
# 4     D   x     0.4
# 5     D   y     0.1
# 6     D   z     0.5

I thought I should extract the weights first:

dfweight_split <- lapply(split(dfweight, dfweight$group), function (x) x$weights)
dfweight_split
# $C
# [1] 0.2 0.7 0.1

# $D
# [1] 0.4 0.1 0.5

But I'm then unsure how to pmap/map over these?

df %>% 
  group_by(group) %>% 
  mutate(wmean = pmap_dbl(list(x, y, z), ~weight.mean(c(..., dfweight_split))))

#OR
df %>% 
  group_by(group) %>% 
  mutate(wmean = map2(list(x, y, z), dfweight_split, ~weight.mean(.x, .y)))

Happy to see base solutions too. A similar post is here.

thanks

user63230
  • 4,095
  • 21
  • 43

2 Answers2

1

I think it will be easier to perform this calculation if you reshape the data to long format. Then join the data with dfweight for each group and column name and find weighted mean for each row.

library(dplyr)
library(tidyr)

df %>%
  mutate(row = row_number()) %>%
  pivot_longer(cols = x:z, names_to = 'cat') %>%
  left_join(dfweight, by = c('group', 'cat')) %>%
  group_by(group, row) %>%
  mutate(weight_mean = weighted.mean(value, weights)) %>%
  ungroup %>%
  select(-weights) %>%
  pivot_wider(names_from = cat, values_from = value) %>%
  select(-row)

#  group weight_mean     x     y     z
#   <chr>       <dbl> <dbl> <dbl> <dbl>
# 1 C            94.1  11.3  110.  147.
# 2 C            89.6  11.9  109.  111.
# 3 C           110.   12.9  134.  133.
# 4 C            97.0  14.5  119.  106.
# 5 C           110.   11.0  138.  113.
# 6 C           102.   14.5  125.  119.
# 7 C           108.   14.7  136.  101.
# 8 C           119.   13.3  150.  119.
# 9 C           100.   13.1  119.  143.
#10 C           111.   10.3  139.  117.
#11 D            88.9  11.3  110.  147.
#12 D            70.9  11.9  109.  111.
#13 D            84.9  12.9  134.  133.
#14 D            70.9  14.5  119.  106.
#15 D            74.9  11.0  138.  113.
#16 D            77.9  14.5  125.  119.
#17 D            69.8  14.7  136.  101.
#18 D            79.8  13.3  150.  119.
#19 D            88.9  13.1  119.  143.
#20 D            76.5  10.3  139.  117.

I got different random numbers with set.seed(1).

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • thanks, this works! `set.seed` output was a typo. I would prefer not to `pivot` if possible as its part of a larger dataset and means I would require more merging later, I wonder is there a `map` solution that may be shorter? – user63230 Feb 18 '21 at 12:51
  • 1
    It should be possible but 1) I don't think code will be any prettier 2) Based on the recent tests (https://stackoverflow.com/questions/66234422/remove-duplicate-values-across-a-few-columns-but-keep-rows/) that we did reshaping is faster than rowwise manipulation. – Ronak Shah Feb 18 '21 at 14:10
  • That is convincing but I'll leave it open for now to see if there are other workarounds – user63230 Feb 18 '21 at 14:26
1

If we want to use pmap, make sure that the 'dfweight' data columns are also in the same dataset. An option is to reshape to wide with pivot_wider, then do a join (right_join) and use pmap to loop over the rows, extract the column elements on the same order with the notation .. before the index, pass those as vector arguments in weighted.mean to create the column in mutate

library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
dfweight %>% 
   pivot_wider(names_from = cat, values_from = weights) %>% 
   rename_at(-1, ~ str_c(., '_weight')) %>%
   right_join(df) %>% 
   mutate(wmean = pmap_dbl(select(., -group), 
     ~ weighted.mean(c(..4, ..5, ..6), c(..1, ..2, ..3)))) %>% 
   select(-ends_with('weight'))
# A tibble: 20 x 5
#   group     x     y     z wmean
#   <chr> <dbl> <dbl> <dbl> <dbl>
# 1 C      11.3  110.  147.  94.1
# 2 C      11.9  109.  111.  89.6
# 3 C      12.9  134.  133. 110. 
# 4 C      14.5  119.  106.  97.0
# 5 C      11.0  138.  113. 110. 
# 6 C      14.5  125.  119. 102. 
# 7 C      14.7  136.  101. 108. 
# 8 C      13.3  150.  119. 119. 
# 9 C      13.1  119.  143. 100. 
#10 C      10.3  139.  117. 111. 
#11 D      11.3  110.  147.  88.9
#12 D      11.9  109.  111.  70.9
#13 D      12.9  134.  133.  84.9
#14 D      14.5  119.  106.  70.9
#15 D      11.0  138.  113.  74.9
#16 D      14.5  125.  119.  77.9
#17 D      14.7  136.  101.  69.8
#18 D      13.3  150.  119.  79.8
#19 D      13.1  119.  143.  88.9
#20 D      10.3  139.  117.  76.5
akrun
  • 874,273
  • 37
  • 540
  • 662