12

Let's say I have a vector of variables like this:

>variable
[1] "A1" "A1" "A1" "A1" "A2" "A2" "A2" "A2" "B1" "B1" "B1" "B1"

and I want to covert this into into a data frame like this:

  treatment time
1         A    1
2         A    1
3         A    1
4         A    1
5         A    2
6         A    2
7         A    2
8         A    2
9         B    1
10        B    1
11        B    1
12        B    1

To that end, I used reshape2's colsplit function. It rquires a pattern to split the string, but I quickly realize there is no obvious pattern to split the two characters without any space. I tried "" and got the following results:

> colsplit(trialm$variable,"",names=c("treatment","time"))
   treatment time
1         NA   A1
2         NA   A1
3         NA   A1
4         NA   A1
5         NA   A2
6         NA   A2
7         NA   A2
8         NA   A2
9         NA   B1
10        NA   B1
11        NA   B1
12        NA   B1

I also tried a lookbehind or lookahead regular expression :

>colsplit(trialm$variable,"(?<=\\w)",names=c("treatment","time"))
Error in gregexpr("(?<=\\w)", c("A1", "A1", "A1", "A1", "A2", "A2", "A2",  : 
  invalid regular expression '(?<=\w)', reason 'Invalid regexp'

but it gave me the above error. How can I solve this problem?

Alby
  • 5,522
  • 7
  • 41
  • 51
  • take a look at `strsplit`. Your code will be something like: `trialm$treatment <- sapply(strsplit(trialm$variable, ''), '[', 1)` – Justin Apr 24 '13 at 15:18
  • I know this is OLD, but the `str_split_fixed` that is used by the `colsplit` function is now written differently, and so the code works as you would have expected it to. – A5C1D2H2I1M1N2O1R2T1 Dec 24 '17 at 14:41

9 Answers9

11

Update: 24 December 2017

Somewhere along the line, the "stringr" package (which is imported with "reshape2" and which is responsible for the splitting that takes place with colsplit) started to use "stringi" for several of its functions. Some behavior seems to have changed because of that.

Using the current "reshape2" (and current "stringr" package), colsplit works the way you would have expected it to with your code:

packageVersion("reshape2")
## [1] ‘1.4.3’
packageVersion("stringr")
## [1] ‘1.2.0’

colsplit(variable, "", names = c("treatment", "time"))
##    treatment time
## 1          A    1
## 2          A    1
## 3          A    1
## 4          A    1
## 5          A    2
## 6          A    2
## 7          A    2
## 8          A    2
## 9          B    1
## 10         B    1
## 11         B    1
## 12         B    1

Original Answer: 24 April 2013

If a pattern can be detected in your "variable" but there is no clean split character that can be used, then add one :)

library(reshape2)
variable <- c("A1", "A1", "A1", "A1", "A2", "A2", 
              "A2", "A2", "B1", "B1", "B1", "B1")
## Here, we add a "." between upper case letters and numbers
colsplit(gsub("([A-Z])([0-9])", "\\1\\.\\2", variable), 
         "\\.", c("Treatment", "Time"))
#    Treatment Time
# 1          A    1
# 2          A    1
# 3          A    1
# 4          A    1
# 5          A    2
# ::::: snip :::: #
# 11         B    1
# 12         B    1

Additional Options: 23 December 2017

My "splitstackshape" package has a single-purpose non-exported helper function called NoSep that can be used for this:

splitstackshape:::NoSep(variable)
##    .var .time_1
## 1     A       1
## 2     A       1
## 3     A       1
## 4     A       1
## 5     A       2
## ::: snip :::: #
## 11    B       1
## 12    B       1

The "tidyverse" (specifically the "tidyr" package) has a couple of convenient functions for splitting values into different columns: separate and extract. separate has already been demonstrated by jazzuro, but the solution is very specific to this particular problem. Also, it generally works better with a delimiter. extract expects you to specify a regular expression with the groups you want to capture:

library(tidyverse)
data.frame(variable) %>% 
  extract(variable, into = c("Treatment", "Time"), regex = "([A-Z]+)([0-9]+)")
#    Treatment Time
# 1          A    1
# 2          A    1
# 3          A    1
# 4          A    1
# 5          A    2
# ::::: snip :::: #
# 11         B    1
# 12         B    1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
9

substr is another way to do it.

> variable <- c(rep("A1", 4), rep("A2", 4), rep("B1", 4))
> data.frame(treatment=substr(variable, 1,1), time=as.numeric(substr(variable,2,2)))
   treatmen time
1         A    1
2         A    1
3         A    1
4         A    1
5         A    2
6         A    2
7         A    2
8         A    2
9         B    1
10        B    1
11        B    1
12        B    1
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
7

If you create a data frame with the vector, variable, you could use separate() from the tidyr package now.

mydf <- data.frame(variable = c(rep("A1", 4), rep("A2", 4), rep("B1", 4)),
                   stringsAsFactors = FALSE)

separate(mydf, variable, c("treatement", "time"), sep = 1)

#   treatement time
#1           A    1
#2           A    1
#3           A    1
#4           A    1
#5           A    2
#6           A    2
#7           A    2
#8           A    2
#9           B    1
#10          B    1
#11          B    1
#12          B    1
jazzurro
  • 23,179
  • 35
  • 66
  • 76
5

You can use substr to split it:

e.g.

df <- data.frame(treatment =   substr(variable, start = 1, stop = 1),
                 time =        substr(variable, start = 2, stop = 2) )
user1317221_G
  • 15,087
  • 3
  • 52
  • 78
4

Another solution using regular expression

require(stringr)
variable <- c(paste0("A", c(rep(1, 4), rep(2, 3))),
              paste0("B", rep(1, 4))
              )

data.frame(
    treatment = str_extract(variable, "[[:alpha:]]"),
    time = as.numeric(str_extract(variable, "[[:digit:]]"))
    )

##    treatment time
## 1          A    1
## 2          A    1
## 3          A    1
## 4          A    1
## 5          A    2
## 6          A    2
## 7          A    2
## 8          B    1
## 9          B    1
## 10         B    1
## 11         B    1
dickoa
  • 18,217
  • 3
  • 36
  • 50
4

A new function tstrsplit() was introduced in data.table v1.9.5. The t stands for transpose. It's the result of splitting a character vector with strsplit() and then transposing it.

# dummy data
library(data.table)
dt <- data.table(var = c(rep("A1", 4), rep("A2", 4), rep("B1", 4)))

Using tstrsplit():

dt[, tstrsplit(var, "")]

    V1 V2
 1:  A  1
 2:  A  1
 3:  A  1
 4:  A  1
 5:  A  2
 6:  A  2
 7:  A  2
 8:  A  2
 9:  B  1
10:  B  1
11:  B  1
12:  B  1

Yes, it's that easy. :-)

Arun
  • 116,683
  • 26
  • 284
  • 387
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
3

You can use substring() to create vectors then join them using the data.frame function.

yyy<-c("A1", "A1", "A1", "A1", "A2", "A2", "A2", "A2", "B1", "B1", "B1", "B1")

treatment<-substring(yyy, 1,1)

time<-as.numeric(substring(yyy,2,2))

data.frame(treatment,time)
Arhopala
  • 376
  • 1
  • 7
2

You could just use strsplit

df <- t(data.frame(strsplit(variable, "")))
rownames(df) <- NULL
colnames(df) <- c("treatment" , "time" )
df
      treatment time
 [1,] "A"       "1" 
 [2,] "A"       "1" 
 [3,] "A"       "1" 
 [4,] "A"       "1" 
 [5,] "A"       "2" 
 [6,] "A"       "2" 
 [7,] "A"       "2" 
 [8,] "A"       "2" 
 [9,] "B"       "1" 
[10,] "B"       "1" 
[11,] "B"       "1" 
[12,] "B"       "1" 

Instead of using t you can use rbind and then coerce to data.frame as follows:

setNames(as.data.frame(do.call(rbind, strsplit(variable, ""))), 
         c("Treatment", "Time"))
#    Treatment Time
# 1          A    1
# 2          A    1
# 3          A    1
# 4          A    1
# 5          A    2
# 6          A    2
# 7          A    2
# 8          B    1
# 9          B    1
# 10         B    1
# 11         B    1
Arun
  • 116,683
  • 26
  • 284
  • 387
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
1

Based on the comment of @Justin I suggest this (using v <- c("A1", "B2")):

> t(sapply(strsplit(v, ''), '[', c(1, 2)))
     [,1] [,2]
[1,] "A"  "1" 
[2,] "B"  "2" 

The vector after `'[' selects the items from the split vector. So I split only once, keeping both items. Maybe this is even easier if you want to keep every item:

t(sapply(strsplit(v, ''), identity))
U. Windl
  • 3,480
  • 26
  • 54