0

I have the following table:

df = structure(list(test_id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4), person = structure(c(1L, 
2L, 3L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 3L), .Label = c("a", "b", 
"c", "d"), class = "factor"), points = c(1, 5, 2, 6, 5, 3, 4, 
5, 6, 2, 1)), .Names = c("test_id", "person", "points"), row.names = c(NA, 
-11L), class = "data.frame")

I want to determine a couple of scenarios and need your help on scenarios 2 & 3:

  1. Which single person tallies the most points across all the tests, e.g. using dplyr: df %>% group_by(person) %>% summarize(most_points = sum(points)) %>% top_n(1,most_points)
  2. Which two people combine to tally the most points across all tests (taking the maximum point value between the two people for each test)

    Expected output: person a & person b would be the best two person combination, with the sum of their points being 17 (b wins tests 1,2 & 3 and a wins test 4).

  3. Which three people combine to tally the most points across all tests (taking the maximum point value between the three people for each test)

    Expected output: person a, person b & person c would be the best three person combination, with the sum of their points being 19 (person b wins tests 1 & 2, person d wins test 3, person a wins test 4)

This is a very simplified table and in reality I have hundreds of thousands of rows I'll be analyzing with many more people and test_id's. Note that not everyone has a point value for each test. I'm not interested in a weighted average, simply maximum points they (or a combination of two/three of them) have accumulated.

John
  • 802
  • 2
  • 9
  • 19
  • Could you provide an example of your expected output for points 2. and 3. ? – Lamia May 24 '17 at 01:53
  • Sorry @Lamia, should have done this from the get go. 2.) Person a & person b would be the best two person combination, with the sum of their points being 17 (b wins tests 1,2 & 3 and a wins test 4). 3.) Person a, person b & person c would be the best three person combination, with the sum of their points being 19 (person b wins tests 1 & 2, person d wins test 3, person a wins test 4). – John May 24 '17 at 03:43
  • In the case of 2 people, b wins tests 1 (with 5 points) and 2 (with 6 points), but d wins test 3 (with 6 points) and a wins test 4 (with 2 points). Therefore shouldn't the best 2 winners combination be "b and d" with 17 points? And in the case of 3 winners "a,b and d" with 19 points? – Lamia May 24 '17 at 13:40
  • In the case of 2 people, there's actually two scenarios that come out to 17. B wins tests 1 (with 5 points), 2 (with 6 points), and 3 (with 4 points), and A wins test 4 (with 2 points), Or B wins tests 1 (with 5 points) and 2 (with 6 points), and D wins test 3 (with 6 points). – John May 24 '17 at 14:55
  • @Lamia: the optimal combination will not always include the winner of each test. Instead, I'm looking at the optimal combination of 2 and 3 people that will maximize the overall sum of points. – John May 24 '17 at 16:17
  • I think I finally understood what you were asking. I'll edit my answer soon.. – Lamia May 24 '17 at 16:21

2 Answers2

1

It took a while to get here, but here's what you could do. We'll first change the format of the data:

library(tidyr)
dfs=spread(df,person,points)
dfs[is.na(dfs)]=0
pers=unique(df$person)

This returns:

  test_id a b c d
1       1 1 5 2 0
2       2 0 6 5 0
3       3 3 4 5 6
4       4 2 0 1 0

Then, we'll look for all the combinations of two people using combn, and determine the maximum value between the two for each test, and sum it across all tests. With that we can identify the pairs that have the highest sum:

cc2=combn(1:length(pers),2)
values2 = sapply(1:ncol(cc2),function(i) sum(apply(dfs[,cc2[,i]+1],1,function(x) max(x))))
names(values2) = apply(cc2,2,function(x) paste(pers[x],collapse="-"))
values2=values2[values2==max(values2)]

This returns:

a-b b-c b-d 
 17  17  17 

For combinations of 3 people, we do the same thing:

cc3=combn(1:length(pers),3)
values3 = sapply(1:ncol(cc3),function(i) sum(apply(dfs[,cc3[,i]+1],1,function(x) max(x))))
names(values3) = apply(cc3,2,function(x) paste(pers[x],collapse="-"))
values3=values3[values3==max(values3)]

This returns:

a-b-d 
   19 
Lamia
  • 3,845
  • 1
  • 12
  • 19
  • Hi @Lamia, this is exactly what I'm looking for, thanks! Since you've responded, I've had to add another element to the analysis. I need to rerun these combinations within groups. Take the example in the below comment, where I've added a "subject" column. I need the max sums within these groups using the same methodology you've outlined already. – John Jun 05 '17 at 14:53
  • structure(list(subject = c("math", "math", "science", "history", "history", "science", "english", "english", "math", "english", "history"), test_id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4), person = structure(c(1L, 2L, 3L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 3L), .Label = c("a", "b", "c", "d"), class = "factor"), points = c(1, 5, 2, 6, 5, 3, 4, 5, 6, 2, 1)), class = "data.frame", row.names = c(NA, -11L), .Names = c("subject", "test_id", "person", "points")) – John Jun 05 '17 at 14:54
  • If you want to do the same analysis for each subject taken separately, you could create a dataframe per subject and do it. Otherwise, split your dataframe into a list `split(df,df$subject)`, and then adapt the analysis to a list format using `lapply`. – Lamia Jun 05 '17 at 17:48
0

Top two people

df %>% 
  group_by(person) %>% 
  summarize(most_points = sum(points)) %>% 
  arrange(desc(most_points)) %>%
  top_n(2, most_points)

# A tibble: 2 × 2
  person most_points
  <fctr>       <dbl>
1      b          20
2      c           8

Top three people (really four with a tie)

df %>% 
  group_by(person) %>% 
  summarize(most_points = sum(points)) %>% 
  arrange(desc(most_points)) %>%
  top_n(3, most_points)

# A tibble: 4 × 2
  person most_points
  <fctr>       <dbl>
1      b          20
2      c           8
3      a           6
4      d           6
manotheshark
  • 4,297
  • 17
  • 30
  • Thanks @manotheshark. Instead, I'd like to find the optimal combination of people that will yield the highest point total. E.g. if b & c were in fact the best combination, then there would just be one output value: the sum of the higher point total between b & c for each item. Hope that is more clear. – John May 23 '17 at 14:11