1

I am trying to filter my data frame to leave the highest value of one variable (Question), for each combination of two others (Person and Test).

My data.frame looks something like this:

df <- data.frame(Person=c("Person1","Person1","Person1","Person2","Person2","Person2","Person3","Person3","Person3"),
             Test=c(rep("Test1",9)),
             Question=c("1","2","3","1","2","3","1","2","3"))

Except that there are multiple tests i.e. Test2, Test3 etc.

I want to filter down to show the last Question in each Test for every Person. There are a different number of Questions in each Test.

Using the response to this question: dplyr filter: Get rows with minimum of variable, but only the first if multiple minima, I managed to get some of the way with:

library(dplyr)
df.grouped <- group_by(df.orginial, Person, Test)
df.lastquestion <- filter(df.grouped, Question == max(Question))

Unfortunately, it leaves me with the highest Question number that each Person answered across all Tests. Whereas, I would like the highest Question number that each Person answered on each Test.

Thanks

Community
  • 1
  • 1
zz9321
  • 73
  • 1
  • 5
  • 1
    `df %>% group_by(Person, Test) %>% slice(which.max(Question))` I think your syntax is correct. – akrun Jul 31 '15 at 14:55
  • 1
    Same logic, `df %>% mutate(Question=as.numeric(Question)) %>% group_by(Person, Test) %>% filter(Question==max(Question))` – dimitris_ps Jul 31 '15 at 14:58
  • 1
    @dimitris_ps Nice catch with `as.numeric`. I didn't notice that. – akrun Jul 31 '15 at 15:01
  • `library(data.table); setDT(df)[order(Question),Question[.N],by=.(Person,Test)]`. Don't need `order(Question)` if you're sure it's sorted by question already. Also don't need to cast as `numeric` since `order` will put it in alphanumeric order. – MichaelChirico Jul 31 '15 at 15:18
  • @MichaelChirico I am not sure whether the `order` does that. For example, `v1 <- paste(1:15); setDT(list(v1))[order(v1)]` – akrun Jul 31 '15 at 15:27
  • hmm. If there's more than 10 questions, use `order(sprintf("%02s",Question))`. Or change 2 to match the max # digits. – MichaelChirico Jul 31 '15 at 15:29
  • Seriously guys; if it's worth answering in the comments without close votes or requests for clarification or more detail: **post a proper answer!!** most of these comments are clearly long enough to be Answers in their own right without extra exposition or padding. – Gavin Simpson Jul 31 '15 at 15:40
  • `which.max` is fine if there is only maximum (as is the case in this example set), but be aware of it's behaviour when there are two or more values that take the maximum value. – Gavin Simpson Jul 31 '15 at 15:43
  • @GavinSimpson i would be amazed if this is not a duplicate. so i'm not posting an answer. – MichaelChirico Jul 31 '15 at 15:43
  • @MichaelChirico I'm sure it is, but the mechanism there is to close as a duplicate thus directing people to the canonical Q&A, possibly merging any answers *if* that wouldn't make the merged Answer clash with the detail in the Question to which they were merged. Duplicates are OK in many cases as they help provide exposure to a solution from multiple viewpoints & keywords, which helps when searching for solutions. Answering in comments does little to rectify the duplicate issue. – Gavin Simpson Jul 31 '15 at 15:46
  • @GavinSimpson then the solution is just close with another dupe, no? – David Arenburg Aug 04 '15 at 11:34

2 Answers2

2

Whilst there'll be a plethora of dplyr, plyr, and data.table options proffered, here's a good ol' fashioned base-R version, using a somewhat expanded (and vastly simplified) version of your example data

df <- data.frame(Person = rep(paste0("Person", 1:3), each = 3, times = 2),
                 Test = rep(paste0("Test", 1:4), each = 9),
                 Question = as.character(rep(1:3, times = 3 * 2)))

You could do this inline, but an explicit wrapper allows me to focus on two aspects of this question

wrapper <- function(x) {
  with(x, x[Question == max(Question), ])
}

You could use which.max(Question) here, but that would select the first of the maxima if more than one value in Question took the same value as the maximum value.

Now we want to split the data and then apply wrapper() to each element. The other packages mentioned above provide more consistent and in some cases faster implementations of this but base-R is often competitive:

ll <- lapply(with(df, split(df, list(Person, Test))), wrapper)

Now just bind everything together:

newdf <- do.call("rbind", c(ll, make.row.names = FALSE))
head(newdf)

Which returns:

> head(newdf)
    Person  Test Question
1  Person1 Test1        3
2  Person2 Test1        3
3  Person3 Test1        3
4  Person1 Test2        3
5  Person2 Test2        3
6  Person3 Test2        3

The whole thing would be:

wrapper <- function(x) {
  with(x, x[Question == max(Question), ])
}
ll <- lapply(with(df, split(df, list(Person, Test))), wrapper)
newdf <- do.call("rbind", c(ll, make.row.names = FALSE))
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
1

Use ave:

df[df$Question == ave(as.numeric(df$Question),list(df$Person,df$Test),FUN = max), ]
ARobertson
  • 2,857
  • 18
  • 24