group_by() summarise() and weights percentages - R

Question

Let's suppose that a company has 3 Bosses and 20 Employees, where each Employee has done n_Projects with an overall Performance in percentage:

> df <- data.frame(Boss = sample(1:3, 20, replace=TRUE),
                  Employee = sample(1:20,20),
                  n_Projects = sample(50:100, 20, replace=TRUE),
                  Performance = round(sample(1:100,20,replace=TRUE)/100,2),
                  stringsAsFactors = FALSE)

> df
   Boss Employee n_Projects Performance
1     3        8         79        0.57
2     1        3         59        0.18
3     1       11         76        0.43
4     2        5         85        0.12
5     2        2         75        0.10
6     2        9         66        0.60
7     2       19         85        0.36
8     1       20         79        0.65
9     2       17         79        0.90
10    3       14         77        0.41
11    1        1         78        0.97
12    1        7         72        0.52
13    2        6         62        0.69
14    2       10         53        0.97
15    3       16         91        0.94
16    3        4         98        0.63
17    1       18         63        0.95
18    2       15         90        0.33
19    1       12         80        0.48
20    1       13         97        0.07

The CEO asks me to compute the quality of the work for each boss. However, he asks for a specific calculation: Each Performance value has to have a weight equal to the n_Project value over the total n_Project for that boss.

For example, for Boss 1 we have a total of 604 n_Projects, where the project 1 has a Performance weight of 0,13 (78/604 * 0,97 = 0,13), project 3 a Performance weight of 0,1 (59/604 * 0,18 = 0,02), and so on. The sum of these Performance weights are the Boss performance, that for Boss 1 is 0,52. So, the final output should be like this:

 Boss  total_Projects Performance
    1             604        0.52
    2             340        0.18  #the values for boss 2 are invented
    3             230        0.43  #the values for boss 3 are invented

However, I'm still struggling with this:

df %>%
  group_by(Boss) %>%
  summarise(total_Projects = sum(n_Projects),
            Weight_Project = n_Projects/sum(total_Projects))

In addition to this problem, can you give me any feedback about this problem (my code, specifically) or any recommendation to improve data-manipulations skills? (you can see in my profile that I have asked a lot of questions like this, but still I'm not able to solve them on my own)

As soon as you don't need grouping anymore, adding an `ungroup()` statement may avoid certain obscure error message when your pipe goes on — Romain, Feb 27 '19 at 20:16

akrun · Answer 1 · 2019-02-27T19:27:18.290

We can get the sum of product of `n_Projects' and 'Performance' and divide by the 'total_projects'

library(dplyr)
df %>% 
    group_by(Boss) %>% 
    summarise(total_projects = sum(n_Projects), 
              Weight_Project = sum(n_Projects * Performance)/total_projects)
    # or
    # Weight_Project = n_Projects %*% Performance/total_projects)
# A tibble: 3 x 3
#   Boss total_projects Weight_Project
#  <int>          <int>          <dbl>
#1     1            604          0.518
#2     2            595          0.475
#3     3            345          0.649

Romain · Accepted Answer · 2019-02-27T20:15:29.553

Adding some more details about what you did and @akrun's answer :

You must have received the following error message :

df %>%
  group_by(Boss) %>%
  summarise(total_Projects = sum(n_Projects),
            Weight_Project = n_Projects/sum(total_Projects))

## Error in summarise_impl(.data, dots) : 
##  Column `Weight_Project` must be length 1 (a summary value), not 7

This tells you that the calculus you made for Weight_Project does not yield a unique value for each Boss, but 7. summarise is there to summarise several values into one (by means, sums, etc.). Here you just divide each value of n_Projects by sum(total_Projects), but you don't summarise it into a single value.

Assuming that what you had in mind was first calculating the weight for each performance, then combining it with the performance mark to yield the weighted mean performance, you can proceed in two steps :

df %>% 
  group_by(Boss) %>% 
  mutate(Weight_Performance = n_Projects / sum(n_Projects)) %>% 
  summarise(weighted_mean_performance = sum(Weight_Performance * Performance))

The mutate statement preserves the number of total rows in df, but sum(n_Projects) is calculated for each Boss value thanks to group_by. Once, for each row, you have a project weight (which depends on the boss), you can calculate the weighted mean — which is a mean thus a summary value — with summarise.

A more compact way that still lets appear the weighted calculus would be :

df %>% 
  group_by(Boss) %>% 
  summarise(weighted_mean_performance = sum((n_Projects / sum(n_Projects)) * Performance))

# Reordering to minimise parenthesis, which is @akrun's answer
df %>% 
  group_by(Boss) %>% 
  summarise(weighted_mean_performance = sum(n_Projects * Performance) / sum(n_Projects))

Indeed, that was exactly the error I had. And thanks for the feedback and explanation, it's super clear!. — Chris, Feb 28 '19 at 12:32

group_by() summarise() and weights percentages - R

2 Answers2