5

I need help merging the rows with the same name (i.e start column) in the data (mydf) and concatenating the content in "ALT" column thereby getting rid of all the duplicate rows based on the similar values in the start column. I want to merge the rows and concatenate the contents in "ALT" column separated by a comma and get the result as shown below. Thank you for your help.

> mydf
         chr          start      end REF ALT         TYPE refGene               
       chr10 chr10:176131 176131   C   A          snp nonsynonymous SNV 
       chr10 chr10:159149 159149   C   G snp:17659149 nonsynonymous SNV 
       chr10 chr10:159149 159149   C   T snp:17659149 nonsynonymous SNV 
       chr10 chr10:241469 241469   T   C          snp          splicing 

> result
         chr          start      end REF ALT         TYPE refGene                   
       chr10 chr10:176131 176131   C   A          snp nonsynonymous SNV 
       chr10 chr10:159149 159149   C   G,T snp:17659149 nonsynonymous SNV 
       chr10 chr10:241469 241469   T   C          snp          splicing 

The DPUT is here:

structure(list(chr = c("chr3", "chr3", "chr3", "chr3"), start = c("chr3:75786036", 
"chr3:75786337", "chr3:75786337", "chr3:75788226"), end = c(75786036, 
75786337, 75786337, 75788226), REF = c("A", "G", "G", "C"), ALT = c("G", 
"A", "T", "A"), TYPE = c("snp:75786036", "snp:75786337", "snp:75786337", 
"snp:75788226"), `refGene::location` = c("nonsynonymous SNV", 
"nonsynonymous SNV", "nonsynonymous SNV", "nonsynonymous SNV"
), `refGene::type` = c("ZNF717:NM_001290208:exon5:c.T2738C:p.F913S,ZNF717:NM_001128223:exon5:c.T2738C:p.F913S,ZNF717:NM_001290209:exon5:c.T2588C:p.F863S,", 
"ZNF717:NM_001290208:exon5:c.C2437T:p.P813S,ZNF717:NM_001128223:exon5:c.C2437T:p.P813S,ZNF717:NM_001290209:exon5:c.C2287T:p.P763S,", 
"ZNF717:NM_001290208:exon5:c.C2437A:p.P813T,ZNF717:NM_001128223:exon5:c.C2437A:p.P813T,ZNF717:NM_001290209:exon5:c.C2287A:p.P763T,", 
"ZNF717:NM_001290208:exon5:c.G548T:p.C183F,ZNF717:NM_001128223:exon5:c.G548T:p.C183F,ZNF717:NM_001290209:exon5:c.G398T:p.C133F,"
)), .Names = c("chr", "start", "end", "REF", "ALT", "TYPE", "refGene::location", 
"refGene::type"), row.names = c("4041", "4051", "4052", "4128"
), class = "data.frame")
MAPK
  • 5,635
  • 4
  • 37
  • 88

3 Answers3

5

Here's a dplyr solution:

library(dplyr)
df %>% group_by(start) %>% 
       mutate(ALT = paste(ALT, collapse=",")) %>%
       distinct(start)
jeremycg
  • 24,657
  • 5
  • 63
  • 74
2

Try aggregate in base R:

newdf <- mydf[!duplicated(mydf$start),]
newdf[, 'ALT'] <- aggregate(ALT~start, data=mydf, toString)[,2]
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • @plafort-Thank you, but how do I remove the duplicated lines. For example, I just want the second row and omit the duplicate that doesn not have concatenated ALT stuff. 1 chr10 chr10:17659149 17659149 C A, G snp:17659149 nonsynonymous SNV 2 chr10 chr10:17659149 17659149 C G, T, A, G, T snp:17659149 nonsynonymous SNV – MAPK Jun 25 '15 at 03:08
  • @plafort-Sorry mate, the ALT concatenation does not seem to be correct. – MAPK Jun 25 '15 at 03:21
  • It works for the example you provided. If there is a case in your data that is not working, please add that example. – Pierre L Jun 25 '15 at 10:48
0

Something like this will work (Not tested: please dput the data)

library(data.table)
setDT(mydf)[,.(REF=paste(REF,collapse=",")),by="TYPE"]
unique(mydf, by="TYPE")
    chr         start      end REF ALT         TYPE refGene::location
1: chr3 chr3:75786036 75786036   A   G snp:75786036 nonsynonymous SNV
2: chr3 chr3:75786337 75786337 G,G   A snp:75786337 nonsynonymous SNV
3: chr3 chr3:75788226 75788226   C   A snp:75788226 nonsynonymous SNV
                                                                                                                       refGene::type
1: ZNF717:NM_001290208:exon5:c.T2738C:p.F913S,ZNF717:NM_001128223:exon5:c.T2738C:p.F913S,ZNF717:NM_001290209:exon5:c.T2588C:p.F863S,
2: ZNF717:NM_001290208:exon5:c.C2437T:p.P813S,ZNF717:NM_001128223:exon5:c.C2437T:p.P813S,ZNF717:NM_001290209:exon5:c.C2287T:p.P763S,
3:    ZNF717:NM_001290208:exon5:c.G548T:p.C183F,ZNF717:NM_001128223:exon5:c.G548T:p.C183F,ZNF717:NM_001290209:exon5:c.G398T:p.C133F,
> 
user227710
  • 3,164
  • 18
  • 35
  • @user227710 Thank you, this works, but it only returns TYPE and REF column. – MAPK Jun 25 '15 at 02:23
  • @plafort-Please refresh the page, I have the DPUT in the original post. Additionally, I would also like this to return all the column and not just TYPE and ALT columns. – MAPK Jun 25 '15 at 02:27