I am trying to summarize values for multiple columns in a dataframe and then change the names of whichever entries fall outside of the top 2 most abundant entries. I have the following example dataframe:
df <- data.frame(
row.names = c("10", "20", "30", "40", "50"),
Class = c("Class_1", "Class_1", "Class_2", "Class_2", "Class_3"),
Order = c("Order_1", "Order_2", "Order_2", "Order_3", "Order_3"),
freq = c(3, 7, 2, 1, 4)
)
Where rownames are unique identifiers, the first two columns store names and the freq column displays a numeric value for each unique entry. I want to group names per column and summarise the freq values for each group, for example:
library(dplyr)
df %>% group_by(Order) %>% summarise(sum = sum(freq))
Order sum
<chr> <dbl>
1 Order_1 3
2 Order_2 9
3 Order_3 5
df %>% group_by(Class) %>% summarise(sum = sum(freq))
Class sum
<chr> <dbl>
1 Class_1 10
2 Class_2 3
3 Class_3 4
Afterwards, I want to detect the top 2 entries with the highest sum values per column, and change every name that does not fall within the top 2 entries to "other". So the preferred output would look like this:
Class Order freq
10 Class_1 Other 3
20 Class_1 Order_2 7
30 Other Order_2 2
40 Other Order_3 1
50 Class_3 Order_3 4
So, essentially the output should be like the original dataframe, however the names that fall outside of the top 2 most abundant entries based on the freq column are changed to "Other".