split characters into two variables in data frame

Question

Let's say I have a vector of variables like this:

>variable
[1] "A1" "A1" "A1" "A1" "A2" "A2" "A2" "A2" "B1" "B1" "B1" "B1"

and I want to covert this into into a data frame like this:

  treatment time
1         A    1
2         A    1
3         A    1
4         A    1
5         A    2
6         A    2
7         A    2
8         A    2
9         B    1
10        B    1
11        B    1
12        B    1

To that end, I used reshape2's colsplit function. It rquires a pattern to split the string, but I quickly realize there is no obvious pattern to split the two characters without any space. I tried "" and got the following results:

> colsplit(trialm$variable,"",names=c("treatment","time"))
   treatment time
1         NA   A1
2         NA   A1
3         NA   A1
4         NA   A1
5         NA   A2
6         NA   A2
7         NA   A2
8         NA   A2
9         NA   B1
10        NA   B1
11        NA   B1
12        NA   B1

I also tried a lookbehind or lookahead regular expression :

>colsplit(trialm$variable,"(?<=\\w)",names=c("treatment","time"))
Error in gregexpr("(?<=\\w)", c("A1", "A1", "A1", "A1", "A2", "A2", "A2",  : 
  invalid regular expression '(?<=\w)', reason 'Invalid regexp'

but it gave me the above error. How can I solve this problem?

take a look at `strsplit`. Your code will be something like: `trialm$treatment <- sapply(strsplit(trialm$variable, ''), '[', 1)` — Justin, Apr 24 '13 at 15:18
I know this is OLD, but the `str_split_fixed` that is used by the `colsplit` function is now written differently, and so the code works as you would have expected it to. — A5C1D2H2I1M1N2O1R2T1, Dec 24 '17 at 14:41

A5C1D2H2I1M1N2O1R2T1 · Answer 1 · 2017-12-24T14:40:53.293

Update: 24 December 2017

Somewhere along the line, the "stringr" package (which is imported with "reshape2" and which is responsible for the splitting that takes place with colsplit) started to use "stringi" for several of its functions. Some behavior seems to have changed because of that.

Using the current "reshape2" (and current "stringr" package), colsplit works the way you would have expected it to with your code:

packageVersion("reshape2")
## [1] ‘1.4.3’
packageVersion("stringr")
## [1] ‘1.2.0’

colsplit(variable, "", names = c("treatment", "time"))
##    treatment time
## 1          A    1
## 2          A    1
## 3          A    1
## 4          A    1
## 5          A    2
## 6          A    2
## 7          A    2
## 8          A    2
## 9          B    1
## 10         B    1
## 11         B    1
## 12         B    1

Original Answer: 24 April 2013

If a pattern can be detected in your "variable" but there is no clean split character that can be used, then add one :)

library(reshape2)
variable <- c("A1", "A1", "A1", "A1", "A2", "A2", 
              "A2", "A2", "B1", "B1", "B1", "B1")
## Here, we add a "." between upper case letters and numbers
colsplit(gsub("([A-Z])([0-9])", "\\1\\.\\2", variable), 
         "\\.", c("Treatment", "Time"))
#    Treatment Time
# 1          A    1
# 2          A    1
# 3          A    1
# 4          A    1
# 5          A    2
# ::::: snip :::: #
# 11         B    1
# 12         B    1

Additional Options: 23 December 2017

My "splitstackshape" package has a single-purpose non-exported helper function called NoSep that can be used for this:

splitstackshape:::NoSep(variable)
##    .var .time_1
## 1     A       1
## 2     A       1
## 3     A       1
## 4     A       1
## 5     A       2
## ::: snip :::: #
## 11    B       1
## 12    B       1

The "tidyverse" (specifically the "tidyr" package) has a couple of convenient functions for splitting values into different columns: separate and extract. separate has already been demonstrated by jazzuro, but the solution is very specific to this particular problem. Also, it generally works better with a delimiter. extract expects you to specify a regular expression with the groups you want to capture:

library(tidyverse)
data.frame(variable) %>% 
  extract(variable, into = c("Treatment", "Time"), regex = "([A-Z]+)([0-9]+)")
#    Treatment Time
# 1          A    1
# 2          A    1
# 3          A    1
# 4          A    1
# 5          A    2
# ::::: snip :::: #
# 11         B    1
# 12         B    1

score 9 · Accepted Answer · answered Apr 24 '13 at 15:24

9

substr is another way to do it.

> variable <- c(rep("A1", 4), rep("A2", 4), rep("B1", 4))
> data.frame(treatment=substr(variable, 1,1), time=as.numeric(substr(variable,2,2)))
   treatmen time
1         A    1
2         A    1
3         A    1
4         A    1
5         A    2
6         A    2
7         A    2
8         A    2
9         B    1
10        B    1
11        B    1
12        B    1

answered Apr 24 '13 at 15:24

Jilber Urbina

58,147
10
114
138

1

ha! +1 if you think best I'll remove it though. – user1317221_G Apr 24 '13 at 15:27
1

But what if some of the variables were, say, "AA1", and "A12"? This method won't succeed with those. – A5C1D2H2I1M1N2O1R2T1 Apr 24 '13 at 16:28
Ananda, you could use regular expressions in your example to separate letters from numbers in tow columns, then see how many discrete categories you have using those two columns. – Arhopala Apr 24 '13 at 16:35
two columns, not 'tow' columns – Arhopala Apr 24 '13 at 16:35

score 7 · Answer 3 · answered Feb 04 '15 at 05:55

If you create a data frame with the vector, variable, you could use separate() from the tidyr package now.

mydf <- data.frame(variable = c(rep("A1", 4), rep("A2", 4), rep("B1", 4)),
                   stringsAsFactors = FALSE)

separate(mydf, variable, c("treatement", "time"), sep = 1)

#   treatement time
#1           A    1
#2           A    1
#3           A    1
#4           A    1
#5           A    2
#6           A    2
#7           A    2
#8           A    2
#9           B    1
#10          B    1
#11          B    1
#12          B    1

score 5 · Answer 4 · answered Apr 24 '13 at 15:24

5

You can use substr to split it:

e.g.

df <- data.frame(treatment =   substr(variable, start = 1, stop = 1),
                 time =        substr(variable, start = 2, stop = 2) )

answered Apr 24 '13 at 15:24

user1317221_G

15,087
3
52
78

score 4 · Answer 5 · answered Apr 24 '13 at 15:28

Another solution using regular expression

require(stringr)
variable <- c(paste0("A", c(rep(1, 4), rep(2, 3))),
              paste0("B", rep(1, 4))
              )

data.frame(
    treatment = str_extract(variable, "[[:alpha:]]"),
    time = as.numeric(str_extract(variable, "[[:digit:]]"))
    )

##    treatment time
## 1          A    1
## 2          A    1
## 3          A    1
## 4          A    1
## 5          A    2
## 6          A    2
## 7          A    2
## 8          B    1
## 9          B    1
## 10         B    1
## 11         B    1

+1. I think this is the *safer* than `substr` and so on if a pattern is discernible but no splitting character is available. — A5C1D2H2I1M1N2O1R2T1, Apr 24 '13 at 16:27

score 4 · Answer 6 · edited Feb 07 '15 at 21:38

4

A new function tstrsplit() was introduced in data.table v1.9.5. The t stands for transpose. It's the result of splitting a character vector with strsplit() and then transposing it.

# dummy data
library(data.table)
dt <- data.table(var = c(rep("A1", 4), rep("A2", 4), rep("B1", 4)))

Using tstrsplit():

dt[, tstrsplit(var, "")]

    V1 V2
 1:  A  1
 2:  A  1
 3:  A  1
 4:  A  1
 5:  A  2
 6:  A  2
 7:  A  2
 8:  A  2
 9:  B  1
10:  B  1
11:  B  1
12:  B  1

Yes, it's that easy. :-)

edited Feb 07 '15 at 21:38

Arun

116,683
26
284
387

answered Feb 04 '15 at 08:34

Rich Scriven

97,041
11
181
245

1

That is new! This is the function I will write down in my notebook today. – jazzurro Feb 04 '15 at 09:57

score 3 · Answer 7 · answered Apr 24 '13 at 15:26

3

You can use substring() to create vectors then join them using the data.frame function.

yyy<-c("A1", "A1", "A1", "A1", "A2", "A2", "A2", "A2", "B1", "B1", "B1", "B1")

treatment<-substring(yyy, 1,1)

time<-as.numeric(substring(yyy,2,2))

data.frame(treatment,time)

answered Apr 24 '13 at 15:26

Arhopala

376
1
7

+1 for realizing that `time` is a `factor` and you want it to be `numeric` using `as.numeric`. – Jilber Urbina Apr 24 '13 at 15:29

score 2 · Answer 8 · edited Apr 24 '13 at 15:31

You could just use strsplit

df <- t(data.frame(strsplit(variable, "")))
rownames(df) <- NULL
colnames(df) <- c("treatment" , "time" )
df
      treatment time
 [1,] "A"       "1" 
 [2,] "A"       "1" 
 [3,] "A"       "1" 
 [4,] "A"       "1" 
 [5,] "A"       "2" 
 [6,] "A"       "2" 
 [7,] "A"       "2" 
 [8,] "A"       "2" 
 [9,] "B"       "1" 
[10,] "B"       "1" 
[11,] "B"       "1" 
[12,] "B"       "1"

Instead of using t you can use rbind and then coerce to data.frame as follows:

setNames(as.data.frame(do.call(rbind, strsplit(variable, ""))), 
         c("Treatment", "Time"))
#    Treatment Time
# 1          A    1
# 2          A    1
# 3          A    1
# 4          A    1
# 5          A    2
# 6          A    2
# 7          A    2
# 8          B    1
# 9          B    1
# 10         B    1
# 11         B    1

Perhaps sufficient for the OP's needs, but what if some of the variables were, say, "AA1", and "A12"? This method won't succeed with those. — A5C1D2H2I1M1N2O1R2T1, Apr 24 '13 at 16:29

U. Windl · Answer 9 · 2017-04-12T12:16:09.403

1

Based on the comment of @Justin I suggest this (using v <- c("A1", "B2")):

> t(sapply(strsplit(v, ''), '[', c(1, 2)))
     [,1] [,2]
[1,] "A"  "1" 
[2,] "B"  "2"

The vector after `'[' selects the items from the split vector. So I split only once, keeping both items. Maybe this is even easier if you want to keep every item:

t(sapply(strsplit(v, ''), identity))

edited Apr 12 '17 at 12:16

answered Apr 11 '17 at 11:25

U. Windl

3,480
26
54

split characters into two variables in data frame

9 Answers9

Update: 24 December 2017

Original Answer: 24 April 2013

Additional Options: 23 December 2017