1

I have a data frame like this:

ID  Exp1 Exp2 Value1
AAA 5    6    7
AAA 4    8    8
BBB 3    5    9
BBB 6    7    4
CCC 2    5    6
....

and I would like to create a new row after every repetition of an ID,and do summation of previous results, like this:

ID      Exp1 Exp2 Value1
AAA     5    6    7
AAA     4    8    8
AAA.1   9    14   15
BBB     3    5    9
BBB     6    7    4
BBB.1   9    12   13
CCC     2    5    6
...

My problem is I cannot write a code to insert a new row right after same IDs.

    > for (i in 1:nrow(Data)) {   
    > temp1 <- Data[Data$ID == Data$ID[i],]   

but do not know how to proceed... Any ideas?

Update: how the original data is..

 GeneNames  Original    ID2          Com.   Ratio   Cyt     Nuc
 YWHAB  CL84Contig6     1433B_HUMAN  -0.2   0.6    1063.3   671.3
 YWHAB  CL84Contig4     1433B_HUMAN  -0.3   0.5    59.0     30.5
 YWHAE  CL1665Contig1   1433E_HUMAN  -0.3   0.5    2784.6   1490.1
 YWHAE  CL1665Contig4   1433E_HUMAN   0.1   1.2    2.1      4.8
 YWHAH  dsrrswapns      1433F_HUMAN   0.0   0.0    0.0      0.0
 YWHAG  CL2762Contig2   1433G_HUMAN  -0.3   0.4    39.5     17.7
 YWHAG  CL2762Contig3   1433G_HUMAN   0.0   0.0    0.0      0.0

how I would like to do that...

GeneNames   Original    ID2          Com.   Ratio   Cyt     Nuc
 YWHAB  CL84Contig6     1433B_HUMAN  -0.2   0.6    1063.3   671.3
 YWHAB  CL84Contig4     1433B_HUMAN  -0.3   0.5    59.0     30.5
YWHAB.1 CL84Contig6     1433B_HUMAN  -0.2   0.6    1122.4   701.8
 YWHAE  CL1665Contig1   1433E_HUMAN  -0.3   0.5    2784.6   1490.1
 YWHAE  CL1665Contig4   1433E_HUMAN   0.1   1.2    2.1      4.8
YWHAE.1 CL1665Contig1   1433E_HUMAN  -0.3   0.5    2786.6   1494.9

I have a data.frame: 13044 obs. of 94 variables: these 94 variables are num and chr columns.. I would like to sum up values only from Cyt and Nuc from same GeneNames, and write them into new row where GeneName is named "GeneName.1". Rest of the columns are not same for each GeneName. I would prefer to leave them either empty or copy the first column of the same GeneName, as in the example..

Samir
  • 23
  • 1
  • 7

1 Answers1

4

You could do this using data.table. Convert the "data.frame" to "data.table" (setDT). Create an "NA" row (.SD[1:(.N+1)]) grouped by "ID", replace the "NA" elements for each "ID" by the sum (lapply(.SD,...))

library(data.table)
setDT(df1)[, .SD[1:(.N+1)], ID][, lapply(.SD, function(x)
        replace(x, is.na(x), sum(x, na.rm=TRUE))) , ID]
#      ID Exp1 Exp2 Value1
#1: AAA    5    6      7
#2: AAA    4    8      8
#3: AAA    9   14     15
#4: BBB    3    5      9
#5: BBB    6    7      4
#6: BBB    9   12     13
#7: CCC    2    5      6
#8: CCC    2    5      6

Or you can rbind the columns with the "sum" by "ID" group. This gets ordered by "ID"

 setDT(df1)[, rbind(.SD,lapply(.SD, sum)), ID]
 #    ID Exp1 Exp2 Value1
 #1: AAA    5    6      7
 #2: AAA    4    8      8
 #3: AAA    9   14     15
 #4: BBB    3    5      9
 #5: BBB    6    7      4
 #6: BBB    9   12     13
 #7: CCC    2    5      6
 #8: CCC    2    5      6

Update

Based on the new dataset, try

  DT1 <- setDT(df1)[, .SD[1:(.N+1)], GeneNames][, 6:7 := lapply(.SD, 
       function(x) replace(x, is.na(x), sum(x, na.rm=TRUE))), 
             GeneNames, .SDcols=6:7]
  DT1[, 2:5 := lapply(.SD, function(x) replace(x, is.na(x),
             x[1L])), GeneNames, .SDcols=2:5][]
  #   GeneNames      Original         ID2 Com. Ratio    Cyt    Nuc
  #1:     YWHAB   CL84Contig6 1433B_HUMAN -0.2   0.6 1063.3  671.3
  #2:     YWHAB   CL84Contig4 1433B_HUMAN -0.3   0.5   59.0   30.5
  #3:     YWHAB   CL84Contig6 1433B_HUMAN -0.2   0.6 1122.3  701.8
  #4:     YWHAE CL1665Contig1 1433E_HUMAN -0.3   0.5 2784.6 1490.1
  #5:     YWHAE CL1665Contig4 1433E_HUMAN  0.1   1.2    2.1    4.8
  #6:     YWHAE CL1665Contig1 1433E_HUMAN -0.3   0.5 2786.7 1494.9
  #7:     YWHAH    dsrrswapns 1433F_HUMAN  0.0   0.0    0.0    0.0
  #8:     YWHAH    dsrrswapns 1433F_HUMAN  0.0   0.0    0.0    0.0
  #9:     YWHAG CL2762Contig2 1433G_HUMAN -0.3   0.4   39.5   17.7
  #10:     YWHAG CL2762Contig3 1433G_HUMAN  0.0   0.0    0.0    0.0
  #11:     YWHAG CL2762Contig2 1433G_HUMAN -0.3   0.4   39.5   17.7

Or using the rbind approach

 DT1 <- setDT(df1)[, rbind(.SD, lapply(.SD, sum)), GeneNames, .SDcols=6:7]
 setkey(df2, GeneNames, Cyt, Nuc)[DT1]

and then change the NAs in column 2:5 to first row value as before

data

 df1 <- structure(list(ID = c("AAA", "AAA", "BBB", "BBB", "CCC"), 
 Exp1 = c(5L, 4L, 3L, 6L, 2L), Exp2 = c(6L, 8L, 5L, 7L, 5L), Value1 = 
 c(7L, 8L, 9L, 4L, 6L)), .Names = c("ID", "Exp1", "Exp2", "Value1"), 
 class = "data.frame", row.names = c(NA, -5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This is WAY too overcomplicated considering he's trying to insert rows into data frame. This is way too overcomplicated for anything but code golf. But yeah, oneliners do work. – LauriK Feb 12 '15 at 12:58
  • Is there an equivalent of `[, .SD[1:(.N+1)], ID]` in `dplyr`? I like this approach, but I cannot see translation of this in `dplyr` yet. – jazzurro Feb 12 '15 at 13:07
  • @jazzurro I think you can use the `rbind` approach and use `arrange` – akrun Feb 12 '15 at 13:09
  • @akrun thanks alot! that is exactly what I want...the thing is the actual data contains more chr columns as well, this poses a problem...would it be possible to make sum for only Exp1 and Exp2 columns? – Samir Feb 12 '15 at 13:28
  • @akrun I came up with a few ideas using `dplyr`. I found them pretty awkward. I also think that taking care of duplicated rows may be challenging. For example, this works in this case. `group_by(mydf, ID) %>% do(rbind(.[, -1], colSums(.[,-1]))) %>% group_by(ID, Exp1, Exp2, Value1) %>% distinct`. But if we are summing up two identical rows for a group, we are deleting a part of an original data. My hope was something like this would work. `group_by(mydf, ID) %>% mutate_each(funs(unlist(list(., sum(.)))))`. Let me know if you have any ideas. – jazzurro Feb 12 '15 at 14:26
  • @Samir My doubt is about filling `chr` columns for the `sum` row. Is it all NAs? – akrun Feb 12 '15 at 14:43
  • @akrun, it does not execute the code, it gives this error "Error in sum(x, na.rm = TRUE) : invalid 'type' (character) of argument" – Samir Feb 12 '15 at 14:47
  • @Samir Have you tried with the `dput` dataset on my post – akrun Feb 12 '15 at 14:49
  • @akrun, just to prevent confusion, with the data set you posted, code works perfectly well! at my original data-set contains many chr rows, I guess that is why it does not work because it cannot sum up. – Samir Feb 12 '15 at 14:53
  • @jazzurro Thanks for your ideas. I was thinking about the `rbind` approach i.e. `rbind(mydf,mydf %>% group_by(ID) %>% summarise_each(funs(sum))) %>% arrange(ID)` – akrun Feb 12 '15 at 14:53
  • @Samir Okay, now I understand the problem. Could you tell me what the `chr` column should fill up for the `sum` row – akrun Feb 12 '15 at 14:54
  • @Samir Also tell us, whether the `chr` column is unique for a particular ID – akrun Feb 12 '15 at 14:56
  • @akrun, sorry for the incomplete question!! so chr columns are not unique, for each ID they are same, therefore, it can just contain the first IDs chr column. – Samir Feb 12 '15 at 15:01
  • @Samir I updated with a variant of my first solution. Here, the `chr` or value column with NA for the sum row. – akrun Feb 12 '15 at 15:02
  • @akrun Thanks for your idea. Your code also have two identical rows for the group, CCC. This is something I tried to express in my previous comment. In your data.table idea, we do not have to deal with this issue. That's why I like the approach. – jazzurro Feb 12 '15 at 15:15
  • @jazzurro Yes, the identical rows is when there is only one observation per ID. `unique` might be an option here. – akrun Feb 12 '15 at 15:17
  • @akrun, Thanks a lot for the help, I am completely new to scripting, even to hard for me to follow... the thing is update didnt work , and gave same error, "Error in sum(x, na.rm = TRUE) : invalid 'type' (character) of argument". I updated the question!, try to be more clear with example from real data... Again, thanks for the helps!!! – Samir Feb 12 '15 at 15:35
  • @Samir I updated the solution. You have to change the column index in the ``.SDcols`. This will fill `NA` for the other columns, which can be filled. – akrun Feb 12 '15 at 15:52
  • @akrun I came back to this question today. I think this could be the translation of your first option in `dplyr`. `group_by(mydf, ID) %>% do(if(length(.$ID) > 1) rbind(.[,-1], NA) else(.[,-1])) %>% mutate_each(funs(replace(., !complete.cases(.), sum(., na.rm = TRUE))))` – jazzurro Feb 15 '15 at 11:56
  • @akrun I just wanted to try one more time. Adding a row with NA seems to require `do`. In the past, I have seen that `do` is slower. I, therefore, think that your data.table way is great. I think it would be much faster. – jazzurro Feb 15 '15 at 12:17
  • @jazzurro Does this method work with the new dataset showed in OP's post – akrun Feb 15 '15 at 12:18
  • @akrun Let me consider this with the new data set. – jazzurro Feb 15 '15 at 12:25
  • 1
    @akrun Not sure about speed, but this is what I have now. `group_by(ana, GeneNames) %>% do(if(length(.$GeneNames) > 1) rbind(.[,-1], NA) else(.[,-1])) %>% mutate_each(funs(replace(., !complete.cases(.), .[1])), 2:5) %>% mutate_each(funs(replace(., !complete.cases(.), sum(., na.rm = TRUE))), Cyt:Nuc)` – jazzurro Feb 15 '15 at 12:36
  • @akrun This is an alternative without `do`. I used micro benchmark. It seems that this is similar to what I previously proposed on my machine. `group_by(ana, GeneNames) %>%filter(n() > 1) %>% summarise(Original = Original[1], ID2 = ID2[1], Com. = Com.[1], Ratio = Ratio[1], Cyt = sum(Cyt), Nuc = sum(Nuc), GeneNames.1 = paste(GeneNames[1], ".1", sep = "")) %>% select(-GeneNames, GeneNames = GeneNames.1) %>% bind_rows(filter(group_by(bob, GeneNames), n() > 1), filter(group_by(bob, GeneNames), n() == 1))` – jazzurro Feb 15 '15 at 13:41
  • @akrun That's all right. I wanted to share what I have tried. – jazzurro Feb 15 '15 at 15:17