2

I am trying to summarize values for multiple columns in a dataframe and then change the names of whichever entries fall outside of the top 2 most abundant entries. I have the following example dataframe:

df <- data.frame(
  row.names = c("10", "20", "30", "40", "50"),
  Class = c("Class_1", "Class_1", "Class_2", "Class_2", "Class_3"),
  Order = c("Order_1", "Order_2", "Order_2", "Order_3", "Order_3"),
  freq = c(3, 7, 2, 1, 4)
)

Where rownames are unique identifiers, the first two columns store names and the freq column displays a numeric value for each unique entry. I want to group names per column and summarise the freq values for each group, for example:

library(dplyr)
df %>% group_by(Order) %>% summarise(sum = sum(freq)) 
  Order     sum
  <chr>   <dbl>
1 Order_1     3
2 Order_2     9
3 Order_3     5

df %>% group_by(Class) %>% summarise(sum = sum(freq))
  Class     sum
  <chr>   <dbl>
1 Class_1    10
2 Class_2     3
3 Class_3     4

Afterwards, I want to detect the top 2 entries with the highest sum values per column, and change every name that does not fall within the top 2 entries to "other". So the preferred output would look like this:

     Class   Order  freq
10 Class_1   Other     3
20 Class_1 Order_2     7
30   Other Order_2     2
40   Other Order_3     1
50 Class_3 Order_3     4

So, essentially the output should be like the original dataframe, however the names that fall outside of the top 2 most abundant entries based on the freq column are changed to "Other".

Kak Schoen
  • 364
  • 2
  • 18

1 Answers1

2

You could use

library(dplyr)

order_vec <-
  df %>% 
  group_by(Order) %>% 
  summarise(sum = sum(freq)) %>% 
  slice_max(sum, n = 2) %>% 
  pull(Order)

class_vec <- 
  df %>% 
  group_by(Class) %>% 
  summarise(sum = sum(freq)) %>% 
  slice_max(sum, n = 2) %>% 
  pull(Class)

to create two vectors for your top Class and Order values. Those vectors are used to determine the columns:

df %>% 
  mutate(Class = if_else(Class %in% class_vec, Class, "other"),
         Order = if_else(Order %in% order_vec, Order, "other"))

This returns

     Class   Order freq
10 Class_1   other    3
20 Class_1 Order_2    7
30   other Order_2    2
40   other Order_3    1
50 Class_3 Order_3    4
Martin Gal
  • 16,640
  • 5
  • 21
  • 39