1

I have a dataframe, df, in which some of the values in the second column, alt, are separated by commas:

ref    alt
ACTGG  A,AATGG
GGC    G,GG
GC     G
AAT    A,AA,AAA

Is there a way in R for me to duplicate each row that has comma-separated values in column df$alt such that each value is show individually? They would be duplicated number_of_commas+1 times to allow each comma-separated value to be shown once. Below is the output I'm be looking for:

ref    alt
ACTGG  AATGG
ACTGG  A
GGC    GG
GGC    G
GC     G
AAT    AAA
AAT    AA
AAT    A

Again, this would only change the values in the second column. The first column, and any other potential columns, would simply be copied.

soosus
  • 1,211
  • 4
  • 18
  • 27
  • Do you really need to duplicate rows? – Fernando May 05 '14 at 15:12
  • @josilber Thanks for referring me to that question! I'll add the answer best fitted for this post, as the top solution on the other one doesn't quite work. – soosus May 05 '14 at 15:52

3 Answers3

3

The concat.split.multiple function from my "splitstackshape" package already handles this:

library(splitstackshape)
concat.split.multiple(df, "alt", ",", "long")

Because it uses reshape, it will result in some rows with NA values.

However, I would suggest checking out the updated version of the function I've been working on a. It's much faster and you can find it here. This version is like the data.table approach you have selected, but you can split multiple columns at a time.

With this function you can do the following:

## Will convert your data.frame to a data.table
cSplit(df, "alt", ",", direction="long")
#      ref   alt
# 1: ACTGG     A
# 2: ACTGG AATGG
# 3:   GGC     G
# 4:   GGC    GG
# 5:    GC     G
# 6:   AAT     A
# 7:   AAT    AA
# 8:   AAT   AAA

Splitting to a "wide" form is the present default:

cSplit(df, "alt", ",")
#      ref alt_1 alt_2 alt_3
# 1: ACTGG     A AATGG    NA
# 2:   GGC     G    GG    NA
# 3:    GC     G    NA    NA
# 4:   AAT     A    AA   AAA
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
2

You can use split-apply-combine, handling each row of df separately:

do.call(rbind, lapply(split(df, seq(nrow(df))),
        function(x) data.frame(ref=x$ref, y=strsplit(x$alt, ",")[[1]])))
#       ref     y
# 1.1 ACTGG     A
# 1.2 ACTGG AATGG
# 2.1   GGC     G
# 2.2   GGC    GG
# 3      GC     G
# 4.1   AAT     A
# 4.2   AAT    AA
# 4.3   AAT   AAA
josliber
  • 43,891
  • 12
  • 98
  • 133
1
library(data.table)
dt <- data.table(df)
dt[, list(alt = unlist(strsplit(as.character(alt), ','))),
         by = list(ref)]

# ref    alt
# ACTGG  AATGG
# ACTGG  A
# GGC    GG
# GGC    G
# GC     G
# AAT    AAA
# AAT    AA
# AAT    A
soosus
  • 1,211
  • 4
  • 18
  • 27