R Selecting highest count cells conditional on a second column

Question

Apologies, if this is a duplicate please let me know, I'll gladly delete.

I am attempting to select the four highest values for different values of another column.

Dataset:

For example, I would like to select four highest values when A=1 (9,7,7,6) then when A=2 (9,8,5,5) and so on...

I have looked a various answers on 'selecting highest values' but was struggling to find an example conditioning on another column.

Many thanks

score 5 · Accepted Answer · answered Aug 01 '17 at 14:56

You could split COUNT by A and then get the top 4 values for each sub-group

lapply(split(df$COUNT, df$A), function(x) head(sort(x, decreasing = TRUE), 4))
#$`1`
#[1] 9 7 7 6

#$`2`
#[1] 9 8 5 5

#$`3`
#[1] 9 8 7 6

#$`4`
#[1] 9 8 3 1

Using sapply could give a more friendly output

sapply(split(df$COUNT, df$A), function(x) head(sort(x, decreasing = TRUE), 4))
#     1 2 3 4
#[1,] 9 9 9 9
#[2,] 7 8 8 8
#[3,] 7 5 7 3
#[4,] 6 5 6 1

score 3 · Answer 2 · answered Aug 01 '17 at 15:03

You can use aggregate :

aggr <- aggregate(COUNT~A,data=DF,function(x)head(sort(x,decreasing=TRUE),4))
> aggr
  A COUNT.1 COUNT.2 COUNT.3 COUNT.4
1 1       9       7       7       6
2 2       9       8       5       5
3 3       9       8       7       6
4 4       9       8       3       1

score 1 · Answer 3 · answered Aug 01 '17 at 15:05

1

library(dplyr)
new1 <- df %>%
  group_by(A)%>%
  summarise(y = tail(sort(COUNT),4))

answered Aug 01 '17 at 15:05

nak5120

4,089
4
35
94

Your code leads to this error message: Error in summarise_impl(.data, dots) : Column `y` must be length 1 (a summary value), not 4 – www Aug 01 '17 at 15:16

score 1 · Answer 4 · answered Aug 01 '17 at 15:09

We can first arrange the data frame by A and descending of COUNT, and the slice the data frame to keep the top 4 rows of each group in A. dt2 is the output of this.

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  arrange(A, desc(COUNT)) %>%
  group_by(A) %>%
  slice(1:4)

After that, we can also change the data frame from long format to wide format. dt3 is the output of this.

dt3 <- dt2 %>%
  mutate(Count_Num = paste0("Count", 1:n())) %>%
  spread(Count_Num, COUNT)

lmo · Answer 5 · 2017-08-01T15:28:32.967

Another base R option is tapply with order and [

tapply(dat$COUNT, dat$A, function(x) x[order(-x)][1:4])
$`1`
[1] 9 7 7 6

$`2`
[1] 9 8 5 5

$`3`
[1] 9 8 7 6

$`4`
[1] 9 8 3 1

to return a named list.

with by

by(dt$COUNT, dt$A, function(x) x[order(-x)][1:4])

you get nicer looking output on the screen.

You can both of these into a matrix with the do.call rbind paradigm. For example,

do.call(rbind, by(dt$COUNT, dt$A, function(x) x[order(-x)][1:4]))
  [,1] [,2] [,3] [,4]
1    9    7    7    6
2    9    8    5    5
3    9    8    7    6
4    9    8    3    1

With data.table, you could do

library(data.table)
setDT(dat)[order(-COUNT), head(COUNT, 4), by="A"]
    A V1
 1: 1  9
 2: 1  7
 3: 1  7
 4: 1  6
 5: 2  9
 6: 2  8
 7: 2  5
 8: 2  5
 9: 3  9
10: 3  8
11: 3  7
12: 3  6
13: 4  9
14: 4  8
15: 4  3
16: 4  1

to return a data.table with the top 4 values.

R Selecting highest count cells conditional on a second column

5 Answers5