Split string into multiple rows by capital letters with cSplit

Question

I have survey data. Some questions allowed for multiple answers. In my data, the different answers are separated by a comma. I want to add a new row in the dataframe for each choice. So I have something like this:

survey$q1 <- c("I like this", "I like that", "I like this, but not much",
 "I like that, but not much", "I like this,I like that", 
"I like this, but not much,I like that")

If commas were only there to divide the multiple choices I'd use:

survey <- cSplit(survey, "q1", ",", direction = "long")

and get the desired result. Given some commas are part of the answer, I tried using comma followed by capital letter as a divider:

survey <- cSplit(survey, "q1", ",(?=[A-Z])", direction = "long")

But for some reason it does not work. It does not give any error, but it does not split strings and also it removes some rows from the dataframe. I then tried using strsplit:

strsplit(survey$1, ",(?=[A-Z])", perl=T)

which works in splitting it correctly, but I'm not able to implement it so that each sentence becomes a different row of the same column, like cSplit does. The required output is:

survey$q1
[1] "I like this"
[2] "I like that"
[3] "I like this, but not much"
[4] "I like that, but not much"
[5] "I like this"
[6] "I like that"
[7] "I like this, but not much"
[8] "I like that"

Is there a way I can get it using one of the 2 methods? Thank you

akrun · Accepted Answer · 2019-09-06T17:32:26.057

2

An option with separate_rows

library(dplyr)
library(tidyr)
survey %>% 
   separate_rows(q1, sep=",(?=[A-Z])")
#                       q1
#1               I like this
#2               I like that
#3 I like this, but not much
#4 I like that, but not much
#5               I like this
#6               I like that
#7 I like this, but not much
#8               I like that

With cSplit, there is an argument fixed which is TRUE by default, but if we use fixed = FALSE, it may fail. May be because it is not optimized for PCRE regex expressions

library(splitstackshape)
cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)

Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed) : invalid regular expression ',(?=[A-Z])', reason 'Invalid regexp'

One option to bypass it would be to modify the column with a function (sub/gsub) that can take PCRE regex to change the sep and then use cSplit on that sep

cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)), 
         "q1", sep=":", direction = "long")
#                        q1
#1:               I like this
#2:               I like that
#3: I like this, but not much
#4: I like that, but not much
#5:               I like this
#6:               I like that
#7: I like this, but not much
#8:               I like that

data

survey <- structure(list(q1 = c("I like this", "I like that", "I like this, but not much", 
"I like that, but not much", "I like this,I like that", "I like this, but not much,I like that"
)), class = "data.frame", row.names = c(NA, -6L))

edited Sep 06 '19 at 17:32

answered Sep 06 '19 at 17:22

akrun

874,273
37
540
662

thank you. I tried but I got this error, I don't know why it wouldn't work with characters: Error in UseMethod("separate_rows_") : no applicable method for 'separate_rows_' applied to an object of class "character" – Antonio Sep 06 '19 at 17:26
@Antonio Can you try on the `data` I showed on my post – akrun Sep 06 '19 at 17:27
I did, I actually get the same error. Plus I have to use survey$q1 as x, rather than just q1 – Antonio Sep 06 '19 at 17:29
I tried checking the cSplit part, you are right, that's exactly what happens – Antonio Sep 06 '19 at 17:30
@Antonio. `separate_rows` takes the data.frame and column name as 'q1'. But if you are using `survey$q1`, I don't see the point why you use `cSplit` – akrun Sep 06 '19 at 17:31
@Antonio I updateddd with `cSplit`, should work now – akrun Sep 06 '19 at 17:31
@Antonio Regarding your comment `Plus I have to use survey$q1 as x, rather than just q1`, It is. not the case when you used `cSplit` – akrun Sep 06 '19 at 17:33
Thank you! it works almost perfectly, there is still a couple of answers it does not detect but I'll play around with it until it does! – Antonio Sep 06 '19 at 17:36
regarding the Plus I have to use survey$q1 as x, rather than just q1, I meant that I got an error using your initial solution with separate_rows – Antonio Sep 06 '19 at 17:37
@Antonio. Okay., then I don't find why you are using `cSplit` instead of `strsplit` – akrun Sep 06 '19 at 17:38
2

@Antonio `separate_rows` is from `tidyr`. I use the dev version of `tidyr`, though it should work if your version is CRAN up-to-date recent version – akrun Sep 06 '19 at 17:39
because I have other columns with ID for example and I need the ID variable to have the same value for the 2 parts of the string that are now 2 different rows – Antonio Sep 06 '19 at 17:40
1

Oh ok I'll check the version. Thank you very much – Antonio Sep 06 '19 at 17:42
1

actually there was nothing to add to your code, I had strings that needed to be split into 3 parts and just running the same command twice did the trick. I don't understand why, but that's fine – Antonio Sep 06 '19 at 18:21

score 1 · Answer 2 · answered Sep 06 '19 at 18:20

The answer by @akrun is the right one. I just wanted to add that, if you need some strings to be split into more than 2 parts, the way for his code to work is simply to run the same line multiple times. I'm not entirely sure why this is the case, but it works

Split string into multiple rows by capital letters with cSplit

2 Answers2

data