I have the following dummy data set:
MYdata = data.frame(fruit = c("apple", "apple", "apple", "apple", "apple", "apple", "apple", "pear", "pear", "pear", "pear", "pear", "pear", "lemon", "lemon", "lemon", "lemon", "lemon", "orange", "orange", "orange", "orange", "plum", "plum", "plum", "plum"), p = c(0.013, 0.018, 0.022, 0.035, 0.001, 0.030, 0.046, 0.031, 0.010, 0.017, 0.035, 0.054, 0.038, 0.038, 0.038, 0.036, 0.042, 0.043, 0.056, 0.062, 0.055, 0.031, 0.023, 0.003, 0.013, 0.009), f = c(3.4, 5.5, 4.4, 3.9, 3.7, 3.0, 1.5, 1.3, 2.4, 1.1, 3.6, 1.4, 1.5, 3.3, 2.0, 1.5, 1.4, 2.1, 4.0, 2.2, 1.7, 3.2, 4.9, 4.4, 2.1, 1.2))
(A) I would like to add column "t". The value in each cell of "t" is based on the values in "p" and "f":
if p<0.05 AND f>2 then write the content of the corresponding cell under "fruit", else write "ns".
(this is probably easy for you guys but I can't get my head wrapped around writing functions)
(B) I would like to add column "top". The content of each cell in column "top" depends on how many times a fruit occurs in column "t". I'm interested in keeping the two most abundant fruits found in "t" ("ns" is NOT to be considered a fruit).
If the fruit in a cell of "t" is one of the two most abundant fruits in all of "t" then write the fruit name into the corresponding cell of "top", else write "other". If the cell of "t" contains "ns" then write "ns" into "top".
Background:
Using my real data set I would like to create a volcano plot (in ggplot2) and I would like to color-code only those "fruits" that pass a certain threshold. The color-coding will therefore be based on the information in column "t".
I am running out of legend space and colors when I create the plot since I have hundreds of "fruits". I therefore would like to color-code only the top 10 "fruits" that pass the thresholds and group the remaining "fruits" that pass the thresholds under "others".
Solved! Part (A) was solved with baptiste's script. Part (B) was solved by combining baptiste's script and jbaums' script:
MYdata = transform(MYdata, top = ifelse(t == "ns", "ns", ifelse(t %in% names(sort(table(t), dec=T))[names(sort(table(t), dec=T))!="ns"][1:2], levels(t)[t], "other")))
Thanks guys!