Conditionally duplicate rows in R while changing values in a single column

Question

I have a dataframe, df, in which some of the values in the second column, alt, are separated by commas:

ref    alt
ACTGG  A,AATGG
GGC    G,GG
GC     G
AAT    A,AA,AAA

Is there a way in R for me to duplicate each row that has comma-separated values in column df$alt such that each value is show individually? They would be duplicated number_of_commas+1 times to allow each comma-separated value to be shown once. Below is the output I'm be looking for:

ref    alt
ACTGG  AATGG
ACTGG  A
GGC    GG
GGC    G
GC     G
AAT    AAA
AAT    AA
AAT    A

Again, this would only change the values in the second column. The first column, and any other potential columns, would simply be copied.

@josilber Thanks for referring me to that question! I'll add the answer best fitted for this post, as the top solution on the other one doesn't quite work. — soosus, May 05 '14 at 15:52

score 3 · Answer 1 · answered May 05 '14 at 17:23

The concat.split.multiple function from my "splitstackshape" package already handles this:

library(splitstackshape)
concat.split.multiple(df, "alt", ",", "long")

Because it uses reshape, it will result in some rows with NA values.

However, I would suggest checking out the updated version of the function I've been working on a. It's much faster and you can find it here. This version is like the data.table approach you have selected, but you can split multiple columns at a time.

With this function you can do the following:

## Will convert your data.frame to a data.table
cSplit(df, "alt", ",", direction="long")
#      ref   alt
# 1: ACTGG     A
# 2: ACTGG AATGG
# 3:   GGC     G
# 4:   GGC    GG
# 5:    GC     G
# 6:   AAT     A
# 7:   AAT    AA
# 8:   AAT   AAA

Splitting to a "wide" form is the present default:

cSplit(df, "alt", ",")
#      ref alt_1 alt_2 alt_3
# 1: ACTGG     A AATGG    NA
# 2:   GGC     G    GG    NA
# 3:    GC     G    NA    NA
# 4:   AAT     A    AA   AAA

Thanks very much! I appreciate the flexibility of this solution. — soosus, May 05 '14 at 17:33

score 2 · Answer 2 · answered May 05 '14 at 15:13

You can use split-apply-combine, handling each row of df separately:

do.call(rbind, lapply(split(df, seq(nrow(df))),
        function(x) data.frame(ref=x$ref, y=strsplit(x$alt, ",")[[1]])))
#       ref     y
# 1.1 ACTGG     A
# 1.2 ACTGG AATGG
# 2.1   GGC     G
# 2.2   GGC    GG
# 3      GC     G
# 4.1   AAT     A
# 4.2   AAT    AA
# 4.3   AAT   AAA

soosus · Answer 3 · 2014-05-05T16:14:09.127

1

library(data.table)
dt <- data.table(df)
dt[, list(alt = unlist(strsplit(as.character(alt), ','))),
         by = list(ref)]

# ref    alt
# ACTGG  AATGG
# ACTGG  A
# GGC    GG
# GGC    G
# GC     G
# AAT    AAA
# AAT    AA
# AAT    A

edited May 05 '14 at 16:14

answered May 05 '14 at 15:52

soosus

1,211
4
18
27

Conditionally duplicate rows in R while changing values in a single column

3 Answers3