merge rows concatenating the contents in one of the columns separated by comma-R programming

Question

I need help merging the rows with the same name (i.e start column) in the data (mydf) and concatenating the content in "ALT" column thereby getting rid of all the duplicate rows based on the similar values in the start column. I want to merge the rows and concatenate the contents in "ALT" column separated by a comma and get the result as shown below. Thank you for your help.

> mydf
         chr          start      end REF ALT         TYPE refGene               
       chr10 chr10:176131 176131   C   A          snp nonsynonymous SNV 
       chr10 chr10:159149 159149   C   G snp:17659149 nonsynonymous SNV 
       chr10 chr10:159149 159149   C   T snp:17659149 nonsynonymous SNV 
       chr10 chr10:241469 241469   T   C          snp          splicing 

> result
         chr          start      end REF ALT         TYPE refGene                   
       chr10 chr10:176131 176131   C   A          snp nonsynonymous SNV 
       chr10 chr10:159149 159149   C   G,T snp:17659149 nonsynonymous SNV 
       chr10 chr10:241469 241469   T   C          snp          splicing

The DPUT is here:

structure(list(chr = c("chr3", "chr3", "chr3", "chr3"), start = c("chr3:75786036", 
"chr3:75786337", "chr3:75786337", "chr3:75788226"), end = c(75786036, 
75786337, 75786337, 75788226), REF = c("A", "G", "G", "C"), ALT = c("G", 
"A", "T", "A"), TYPE = c("snp:75786036", "snp:75786337", "snp:75786337", 
"snp:75788226"), `refGene::location` = c("nonsynonymous SNV", 
"nonsynonymous SNV", "nonsynonymous SNV", "nonsynonymous SNV"
), `refGene::type` = c("ZNF717:NM_001290208:exon5:c.T2738C:p.F913S,ZNF717:NM_001128223:exon5:c.T2738C:p.F913S,ZNF717:NM_001290209:exon5:c.T2588C:p.F863S,", 
"ZNF717:NM_001290208:exon5:c.C2437T:p.P813S,ZNF717:NM_001128223:exon5:c.C2437T:p.P813S,ZNF717:NM_001290209:exon5:c.C2287T:p.P763S,", 
"ZNF717:NM_001290208:exon5:c.C2437A:p.P813T,ZNF717:NM_001128223:exon5:c.C2437A:p.P813T,ZNF717:NM_001290209:exon5:c.C2287A:p.P763T,", 
"ZNF717:NM_001290208:exon5:c.G548T:p.C183F,ZNF717:NM_001128223:exon5:c.G548T:p.C183F,ZNF717:NM_001290209:exon5:c.G398T:p.C133F,"
)), .Names = c("chr", "start", "end", "REF", "ALT", "TYPE", "refGene::location", 
"refGene::type"), row.names = c("4041", "4051", "4052", "4128"
), class = "data.frame")

score 5 · Accepted Answer · answered Jun 25 '15 at 03:14

5

Here's a dplyr solution:

library(dplyr)
df %>% group_by(start) %>% 
       mutate(ALT = paste(ALT, collapse=",")) %>%
       distinct(start)

answered Jun 25 '15 at 03:14

jeremycg

24,657
5
63
74

score 2 · Answer 2 · answered Jun 25 '15 at 02:59

2

Try aggregate in base R:

newdf <- mydf[!duplicated(mydf$start),]
newdf[, 'ALT'] <- aggregate(ALT~start, data=mydf, toString)[,2]

answered Jun 25 '15 at 02:59

Pierre L

28,203
6
47
69

@plafort-Thank you, but how do I remove the duplicated lines. For example, I just want the second row and omit the duplicate that doesn not have concatenated ALT stuff. 1 chr10 chr10:17659149 17659149 C A, G snp:17659149 nonsynonymous SNV 2 chr10 chr10:17659149 17659149 C G, T, A, G, T snp:17659149 nonsynonymous SNV – MAPK Jun 25 '15 at 03:08
@plafort-Sorry mate, the ALT concatenation does not seem to be correct. – MAPK Jun 25 '15 at 03:21
It works for the example you provided. If there is a case in your data that is not working, please add that example. – Pierre L Jun 25 '15 at 10:48

user227710 · Answer 3 · 2015-06-25T20:57:32.990

Something like this will work (Not tested: please dput the data)

library(data.table)
setDT(mydf)[,.(REF=paste(REF,collapse=",")),by="TYPE"]
unique(mydf, by="TYPE")
    chr         start      end REF ALT         TYPE refGene::location
1: chr3 chr3:75786036 75786036   A   G snp:75786036 nonsynonymous SNV
2: chr3 chr3:75786337 75786337 G,G   A snp:75786337 nonsynonymous SNV
3: chr3 chr3:75788226 75788226   C   A snp:75788226 nonsynonymous SNV
                                                                                                                       refGene::type
1: ZNF717:NM_001290208:exon5:c.T2738C:p.F913S,ZNF717:NM_001128223:exon5:c.T2738C:p.F913S,ZNF717:NM_001290209:exon5:c.T2588C:p.F863S,
2: ZNF717:NM_001290208:exon5:c.C2437T:p.P813S,ZNF717:NM_001128223:exon5:c.C2437T:p.P813S,ZNF717:NM_001290209:exon5:c.C2287T:p.P763S,
3:    ZNF717:NM_001290208:exon5:c.G548T:p.C183F,ZNF717:NM_001128223:exon5:c.G548T:p.C183F,ZNF717:NM_001290209:exon5:c.G398T:p.C133F,
>

@user227710 Thank you, this works, but it only returns TYPE and REF column. — MAPK, Jun 25 '15 at 02:23
@plafort-Please refresh the page, I have the DPUT in the original post. Additionally, I would also like this to return all the column and not just TYPE and ALT columns. — MAPK, Jun 25 '15 at 02:27

merge rows concatenating the contents in one of the columns separated by comma-R programming

3 Answers3

Linked

Related