dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

Question

I want to make a grouped filter using dplyr, in a way that within each group only that row is returned which has the minimum value of variable x.

My problem is: As expected, in the case of multiple minima all rows with the minimum value are returned. But in my case, I only want the first row if multiple minima are present.

Here's an example:

df <- data.frame(
A=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
x=c(1, 1, 2, 2, 3, 4, 5, 5, 5),
y=rnorm(9)
)

library(dplyr)
df.g <- group_by(df, A)
filter(df.g, x == min(x))

As expected, all minima are returned:

Source: local data frame [6 x 3]
Groups: A

  A x           y
1 A 1 -1.04584335
2 A 1  0.97949399
3 B 2  0.79600971
4 C 5 -0.08655151
5 C 5  0.16649962
6 C 5 -0.05948012

With ddply, I would have approach the task that way:

library(plyr)
ddply(df, .(A), function(z) {
    z[z$x == min(z$x), ][1, ]
})

... which works:

  A x           y
1 A 1 -1.04584335
2 B 2  0.79600971
3 C 5 -0.08655151

Q: Is there a way to approach this in dplyr? (For speed reasons)

@RicardoSaporta, it requires `ties=first` argument. FelixS, if you're going for speed, `rank` is a bad idea - it's computationally more demanding than `min` (or) `which.min`. — Arun, Jan 23 '14 at 17:04
Actually you probably want `min_rank()`. @Arun: dplyr provides an internal implementation of `min_rank()` that should be plenty fast enough. — hadley, Jan 24 '14 at 00:53
@hadley, 1) I don't think `min_rank` helps here. He needs the first min value (look at `plyr` solution). 2) In whatever programming language you write, the algorithmic complexity of `rank` (ties=min, max, first etc..) will be bigger than just computing `min`. — Arun, Jan 24 '14 at 01:08
@Arun: True, only `rank(x, ties.method="first")==1` works, as min and min_rank do not differentiate between multiple minima. — Felix S, Jan 24 '14 at 08:13
@FelixS, right. But that's quite expensive (because it has to sort all values for every group). You should be using `which.min`. But I don't know of a `dplyr` solution. A `plyr` case would be: `ddply(df, .(A), function(z) z[which.min(z$x), ])` — Arun, Jan 24 '14 at 11:44
@Arun it's assuming that an O(n lg n) algorithm will perform significantly worse that an O(n) algorithm without further information about the domain — hadley, Jan 25 '14 at 19:36
Sorry, I was actually thinking of `row_number()` - it's the equivalent of `rank(ties = "first")` but is be implemented more efficiently in C++. — hadley, Jan 25 '14 at 19:38
@hadley, I still don't see how that makes you consider `which.min` to be premature optimisation. AFAIK it's a natural choice, reads well, easy to understand, fast as it happens to be O(n) too. — Arun, Jan 28 '14 at 00:06
With the time I saved finding this answer, I benchmarked the `which.min()` and `rank()` solutions. For a 40k x 24 data table, these take about 130 vs 240 ms on my machine. — blongworth, Nov 21 '19 at 14:27

talat · Answer 1 · 2017-02-07T08:34:04.547

121

Update

With dplyr >= 0.3 you can use the slice function in combination with which.min, which would be my favorite approach for this task:

df %>% group_by(A) %>% slice(which.min(x))
#Source: local data frame [3 x 3]
#Groups: A
#
#  A x          y
#1 A 1  0.2979772
#2 B 2 -1.1265265
#3 C 5 -1.1952004

Original answer

For the sample data, it is also possible to use two filter after each other:

group_by(df, A) %>% 
  filter(x == min(x)) %>% 
  filter(1:n() == 1)

edited Feb 07 '17 at 08:34

answered May 20 '14 at 11:42

talat

68,970
21
126
157

3

I find `do(head)` easier to read, `df %>% group_by(A) %>% filter(x == min(x)) %>% do(head(.,1))` – baptiste May 20 '14 at 12:03
@baptiste that looks nice indeed (however, when i run it, i get an error message `Error: expecting a single value`) - do you know why? – talat May 20 '14 at 12:07
not sure, maybe we're using a different version; I have `dplyr_0.2, magrittr_1.0.0` – baptiste May 20 '14 at 13:02
Ok, so the problem is I'm still running dplyr 0.1.3. Thx – talat May 20 '14 at 13:05
1

I’d prefer being able to use `top_n` here but due to ties this method is probably the clear winner — definitely in terms of performance (when compared to `arrange %>% slice`). – Konrad Rudolph Nov 13 '15 at 14:53
In dplyr version 1.0.2, there is ```slice_max()``` which lets you select rows with highest or lowest values of a variable. Hence, this answer can be abbreviated as: ```df %>% group_by(A) %>% slice_max(x)``` – Kay Sep 08 '20 at 12:41

score 39 · Accepted Answer · answered Feb 07 '14 at 07:53

39

Just for completeness: Here's the final dplyr solution, derived from the comments of @hadley and @Arun:

library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)

answered Feb 07 '14 at 07:53

Felix S

1,769
2
13
17

score 16 · Answer 3 · answered Feb 06 '14 at 21:54

For what it's worth, here's a data.table solution, to those who may be interested:

# approach with setting keys
dt <- as.data.table(df)
setkey(dt, A,x)
dt[J(unique(A)), mult="first"]

# without using keys
dt <- as.data.table(df)
dt[dt[, .I[which.min(x)], by=A]$V1]

junkka · Answer 4 · 2014-11-21T16:55:26.423

6

This can be accomplished by using row_number combined with group_by. row_number handles ties by assigning a rank not only by the value but also by the relative order within the vector. To get the first row of each group with the minimum value of x:

df.g <- group_by(df, A)
filter(df.g, row_number(x) == 1)

For more information see the dplyr vignette on window functions.

edited Nov 21 '14 at 16:55

answered Nov 21 '14 at 09:33

junkka

543
7
11

score 6 · Answer 5 · answered Jan 29 '21 at 18:36

6

dplyr offers slice_min function, wich do the job with the argument with_ties = FALSE

library(dplyr)

df %>% 
  group_by(A) %>% 
  slice_min(x, with_ties = FALSE)

Output :

# A tibble: 3 x 3
# Groups:   A [3]
A         x      y
<fct> <dbl>  <dbl>
1 A         1  0.273
2 B         2 -0.462
3 C         5  1.08

answered Jan 29 '21 at 18:36

Benoit Lamarsaude

375
3
11

This is my favourite answer! – Przemyslaw Remin Apr 09 '21 at 14:07

score 2 · Answer 6 · answered Feb 25 '22 at 23:27

In case you are looking to filter the minima of x and then the minima of y. An intuitive way of do it is just using filtering functions:

> df
  A x            y
1 A 1  1.856368296
2 A 1 -0.298284187
3 A 2  0.800047796
4 B 2  0.107289719
5 B 3  0.641819999
6 B 4  0.650542284
7 C 5  0.422465687
8 C 5  0.009819306
9 C 5 -0.482082635  

df %>% group_by(A) %>% 
       filter(x == min(x), y == min(y))
 
# A tibble: 3 x 3
# Groups:   A [3]
  A         x      y
  <chr> <dbl>  <dbl>
1 A         1 -0.298
2 B         2  0.107
3 C         5 -0.482

This code will filter the minima of x and y.

Also you can do a double filter that looks even more readable:

df %>% group_by(A) %>% 
  filter(x == min(x)) %>%
  filter(y == min(y))

# A tibble: 3 x 3
# Groups:   A [3]
  A         x      y
  <chr> <dbl>  <dbl>
1 A         1 -0.298
2 B         2  0.107
3 C         5 -0.482

score 1 · Answer 7 · answered Oct 27 '16 at 19:04

Another way to do it:

set.seed(1)
x <- data.frame(a = rep(1:2, each = 10), b = rnorm(20))
x <- dplyr::arrange(x, a, b)
dplyr::filter(x, !duplicated(a))

Result:

  a          b
1 1 -0.8356286
2 2 -2.2146999

Could also be easily adapted for getting the row in each group with maximum value.

score 0 · Answer 8 · edited Nov 30 '14 at 16:01

0

I like sqldf for its simplicity..

sqldf("select A,min(X),y from 'df.g' group by A")

Output:

A min(X)          y

1 A      1 -1.4836989

2 B      2  0.3755771

3 C      5  0.9284441

edited Nov 30 '14 at 16:01

Andy

49,085
60
166
233

answered Nov 30 '14 at 16:00

nsr

9
1

score 0 · Answer 9 · answered May 17 '22 at 10:29

0

For the sake of completeness, here's the base R answer:

df[with(df, ave(x, A, FUN = \(x) rank(x, ties.method = "first")) == 1), ]

#  A x          y
#1 A 1  0.1076158
#4 B 2 -1.3909084
#7 C 5  0.3511618

answered May 17 '22 at 10:29

Maël

45,206
3
29
67

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

9 Answers9

Update

Original answer

Linked

Related