4

I want to split characters. Although I have a large dataframe to work, the following small example to show what need to be done.

  mydf <- data.frame (name = c("L1", "L2", "L3"), 
    M1 = c("AC", "AT", NA), M2 = c("CC", "--", "TC"), M3 = c("AT", "TT", "AG"))

I want to split the characters for variables M1 to M3 (in real dataset I have > 6000 variables)

  name  M1a M1b   M2a M2b  M3a  M3b 
   L1   A    C    C    C    A     T
   L2   A    T    -    -    T     T
   L3   NA   NA   T     C    A     G

I tried the following codes:

func<- function(x) {sapply( strsplit(x, ""),
                     match, table= c("A","C","T","G", "--", NA))}

odataframe <- data.frame(apply(mydf, 1, func) )
colnames(odataframe) <-  paste(rep(names(mydf), each = 2), c("a", "b"), sep = "")
odataframe
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
jon
  • 11,186
  • 19
  • 80
  • 132

2 Answers2

3

Here you go:

splitCol <- function(x){
  x <- as.character(x)
  x[is.na(x)] <- "$$"
  z <- matrix(unlist(strsplit(x, split="")), ncol=2, byrow=TRUE)
  z[z=="$"] <- NA
  z
}


newdf <- as.data.frame(do.call(cbind, lapply(mydf[, -1], splitCol)))
names(newdf) <- paste(rep(names(mydf[, -1]), each=2), c("a", "b"), sep="")
newdf <- data.frame(mydf[, 1, drop=FALSE], newdf)

newdf
  name  M1a  M1b M2a M2b M3a M3b
1   L1    A    C   C   C   A   T
2   L2    A    T   -   -   T   T
3   L3 <NA>  <NA   T   C   A   G
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • thank you for the prompt reply, it is nice still it seems there is problem associated with handling of NA in M1a and M1b in third row should be NA and NA (not NA and A) – jon Nov 01 '11 at 21:22
  • I fixed this just seconds before your comment. Please try again. – Andrie Nov 01 '11 at 21:23
  • This could be very applicable to me. Thanks for posting. I chunked it together into a function and figured I'd share. Thanks for the post. – Tyler Rinker Nov 01 '11 at 21:46
1

Andrie's code as a replicable function

splitCol <- function(dataframe, splitVars=names(dataframe)){
split.DF <- dataframe[,splitVars]
keep.DF <- dataframe[, !names(dataframe) %in% c(splitVars)]

X <- function(x)matrix(unlist(strsplit(as.character(x), split="")), ncol=2, byrow=TRUE)

newdf <- as.data.frame(do.call(cbind, suppressWarnings(lapply(split.DF, X))) )
names(newdf) <- paste(rep(names(split.DF), each=2), c(".a", ".b"), sep="") 
data.frame(keep.DF,newdf)
}

Test it out

splitCol(mydf)
splitCol(mydf, c('M1','M2'))

Please don't vote this as the correct answer. Andrie's answer is clearly the first correct answer. This is just an extension of his code to more situations. Thanks for the question and thanks for the code Andrie.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519