0

I have the following dummy data set:

MYdata = data.frame(fruit = c("apple", "apple", "apple", "apple", "apple", "apple", "apple", "pear", "pear", "pear", "pear", "pear", "pear", "lemon", "lemon", "lemon", "lemon", "lemon", "orange", "orange", "orange", "orange", "plum", "plum", "plum", "plum"), p = c(0.013, 0.018, 0.022, 0.035, 0.001, 0.030, 0.046, 0.031, 0.010, 0.017, 0.035, 0.054, 0.038, 0.038, 0.038, 0.036, 0.042, 0.043, 0.056, 0.062, 0.055, 0.031, 0.023, 0.003, 0.013, 0.009), f = c(3.4, 5.5, 4.4, 3.9, 3.7, 3.0, 1.5, 1.3, 2.4, 1.1, 3.6, 1.4, 1.5, 3.3, 2.0, 1.5, 1.4, 2.1, 4.0, 2.2, 1.7, 3.2, 4.9, 4.4, 2.1, 1.2))

(A) I would like to add column "t". The value in each cell of "t" is based on the values in "p" and "f":

if p<0.05 AND f>2 then write the content of the corresponding cell under "fruit", else write "ns".

(this is probably easy for you guys but I can't get my head wrapped around writing functions)

(B) I would like to add column "top". The content of each cell in column "top" depends on how many times a fruit occurs in column "t". I'm interested in keeping the two most abundant fruits found in "t" ("ns" is NOT to be considered a fruit).

If the fruit in a cell of "t" is one of the two most abundant fruits in all of "t" then write the fruit name into the corresponding cell of "top", else write "other". If the cell of "t" contains "ns" then write "ns" into "top".

Background:
Using my real data set I would like to create a volcano plot (in ggplot2) and I would like to color-code only those "fruits" that pass a certain threshold. The color-coding will therefore be based on the information in column "t".
I am running out of legend space and colors when I create the plot since I have hundreds of "fruits". I therefore would like to color-code only the top 10 "fruits" that pass the thresholds and group the remaining "fruits" that pass the thresholds under "others".

Solved! Part (A) was solved with baptiste's script. Part (B) was solved by combining baptiste's script and jbaums' script:

MYdata = transform(MYdata, top = ifelse(t == "ns", "ns", ifelse(t %in% names(sort(table(t), dec=T))[names(sort(table(t), dec=T))!="ns"][1:2], levels(t)[t], "other")))

Thanks guys!

Dalmuti71
  • 1,509
  • 3
  • 15
  • 19
  • 6
    [unhelpful joke] after reading the question, I'm disappointed that I can't say the answer is to `melt` the dataset :( – baptiste May 15 '13 at 02:53
  • 1
    `plyr::mutate(MYdata, t = factor(ifelse(p<0.05 & f>2, levels(fruit)[fruit], "ns")))`, the second step could be along the lines of `ifelse(as.integer(t) %in% order(table(t), decreasing=TRUE)[1:2], levels(fruit)[t], "other"))` but it's not quite right... – baptiste May 15 '13 at 03:14
  • 1
    Bit messy, but I think (B) could be done with: `MYdata$top <- with(MYdata, ifelse(t=='ns', 'ns', ifelse(t %in% names(sort(table(t), dec=T))[names(sort(table(t), dec=T))!='ns'][1:2], t, 'other')))`, where the 1:2 subset indicates that you're interested in the 2 most abundant fruits. – jbaums May 15 '13 at 04:27
  • @baptiste: using plyr to address (A) is a lava-hot idea and it worked with the dummy data set. – Dalmuti71 May 15 '13 at 16:11
  • @jbaums: applying your suggestion to the dummy data set results in the following error: "Error in t == "ns" : comparison (1) is possible only for atomic and list types" – Dalmuti71 May 15 '13 at 16:14
  • you really don't need plyr though, plain `transform()` will do the same as `mutate`, which is only useful if you want to combine multiple operations into one line – baptiste May 15 '13 at 21:07

0 Answers0