How to filter for value in one column based on values from other columns in R?

Question

Here is my dataset:

df <- data.frame(label = c(1,2,3,4,5), measurement = c(100.5, 84.7, 100.7, 77.9, 98.8), size = c(20, 19, 20, 20, 15))

Now I want to be able to obtain the label (from label column) that has the maximum size. However, as in the example above, three of the labels have the maximum value of 20. I want my tie breaker to be the measurement values. So in this case, out of the three values of 20 from the size column, measurement of 100.7 is the greatest.

So I would run the code and it would return 3 has the label I should go with. How can I do this across columns?

Have you thought about dupes in the value columns? See my [answer](https://stackoverflow.com/a/65630459/6574038) below. — jay.sf, Jan 08 '21 at 15:14

score 1 · Accepted Answer · answered Jan 08 '21 at 13:48

1

You can order the dataframe in descending order based on size and measurement column and then extract the first label from it.

df1 <- df[with(df, order(-size, -measurement)), ]
df1$label[1]
#[1] 3

Or sort them in ascending order and extract the last value. Using dplyr :

library(dplyr)

df %>%
  arrange(size, measurement) %>%
  pull(label) %>% last
#[1] 3

answered Jan 08 '21 at 13:48

Ronak Shah

377,200
20
156
213

1

That doesn't take duplicates into account, though, resp. it omits them. – jay.sf Jan 08 '21 at 14:44

score 1 · Answer 2 · answered Jan 08 '21 at 13:53

The filter() function from dplyr will operate sequentially when specified as comma separated arguments. You can do something like this to filter to the observation and then pull() the label.

This method will give you multiple responses if the tiebreaker is also tied.

library(dplyr)

df %>% 
  filter(size == max(size), measurement == max(measurement)) %>% 
  pull(label)

score 1 · Answer 3 · answered Jan 08 '21 at 14:02

1

In one line in base R:

df[order(df$size, df$measurement, decreasing = TRUE)[1], "label"]
# [1] 3

answered Jan 08 '21 at 14:02

s_baldur

29,441
4
36
69

2

Nice, but think `df <- rbind(df, c(label=6, measurement=100.7, size=20))`. Accepted answer also doesn't consider this. `dplyr` even yields `6`. – jay.sf Jan 08 '21 at 14:26
@jay.sf sorry I don't see the issue here. The two observations will be tied? – s_baldur Jan 11 '21 at 10:19
Exactly not, i.e. ties won't be returned, just one observation will be displayed, the others omitted. Try with the data from my answer. – jay.sf Jan 11 '21 at 10:47
Good to be aware of it but I think we cannot tell which solution the OP would prefer. He doesn't mention that corner case. – s_baldur Jan 11 '21 at 11:24

jay.sf · Answer 4 · 2021-01-08T15:12:39.317

It's probably wise to work with rank here, to account for duplicates (in contrast to order, which selects the first maximum and dplyr::last the last, thus omitting the other maxima).

Think of this data frame with duplicates in the value columns:

df
#   label measurement size
# 1     1       100.5   20
# 2     2        84.7   19
# 3     3       100.7   20  ## !
# 4     4        77.9   20
# 5     5        98.8   15
# 6     6       100.7   20  ## !

We would add the rowSums of the ranks of the value columns and add the rank of the "tie breaker" size and finally subset the levels on the maxima.

rankv <- Vectorize(rank)
R <- rankv(df[-1])  ## or `apply(df[-1], 2, rank)`
r <- rowSums(R) + rank(R[,2])
df$label[r == max(r)]
# [1] 3 6

Data:

df <- structure(list(label = c(1, 2, 3, 4, 5, 6), measurement = c(100.5, 
84.7, 100.7, 77.9, 98.8, 100.7), size = c(20, 19, 20, 20, 15, 
20)), row.names = c(NA, -6L), class = "data.frame")

How to filter for value in one column based on values from other columns in R?

4 Answers4