R: How to separate values only after the second space

Question

I have a column with different names:

X <- c("Ashley, Tremond WILLIAMS, Carla", "Claire, Daron", "Luw, Douglas CANSLER, Stephan")

After the second space, it starts the name of the second person. For instance, Ashley, Tremond is a person and WILLIAMS, Carla another one.

I have tried:

strsplit(X, "\\,\\s|\\,|\\s")

but it divides by all the spaces, so i get:

strsplit(X, "\\,\\s|\\,|\\s")
[[1]]
[1] "Ashley"   "Tremond"  "WILLIAMS" "Carla"   

[[2]]
[1] "Claire" "Daron" 

[[3]]
[1] "Luw"     "Douglas" "CANSLER" "Stephan"

How can I separate only after the first space, so I get?:

[1] "Ashley, Tremond"  "WILLIAMS, Carla"   

[[2]]
[1] "Claire, Daron" 

[[3]]
[1] "Luw, Douglas" "CANSLER, Stephan"

Thanks in advance for all your help

`strsplit(X, "[^,] ")` gives the desired output. It splits the string where a space is not preceded by a comma. — ytk, Jan 26 '17 at 21:31
You'll want to unlist it to maintain the vector: `unlist(strsplit(X, split = "[A-z] [A-z]"))` — Ryan Morton, Jan 26 '17 at 21:34
@RyanMorton , if you skip the `unlist` call, it preserves the grouping level of names in the original input, and matches the expected output — Aramis7d, Jan 26 '17 at 23:59
The expected outcome was edited into the post after my original response, but yes. strsplit() returns a list. — Ryan Morton, Jan 27 '17 at 00:01

Aramis7d · Answer 1 · 2017-01-27T17:24:10.660

1

Of course @ytk's comment works, but in case you want to avoid the regex, you can be sneaky and do

df2 <- df %>%
  separate(col = X, into=c("person1a","person1b","person2a","person2b"),sep= " ") %>%
  unite(col = "person1", person1a, person1b, sep=" ") %>%
  unite(col = "person2", person2a, person2b, sep=" ")

which returns:

> df2
          person1          person2
1 Ashley, Tremond  WILLIAMS, Carla
2   Claire, Daron            NA NA
3    Luw, Douglas CANSLER, Stephan

p.s. I use df <- data.frame(X = c("Ashley, Tremond WILLIAMS, Carla", "Claire, Daron", "Luw, Douglas CANSLER, Stephan")) to make the input into a dataframe.

edited Jan 27 '17 at 17:24

answered Jan 26 '17 at 23:57

Aramis7d

2,444
19
25

thanks but I writing the exact same code and it is not working for me, and I don't' really understand it, what does the %>% means? – Natalia P Jan 27 '17 at 15:48
@NataliaP it's a way of `piping` syntax , check out the `magrittr` package. – Aramis7d Jan 27 '17 at 17:24

score 1 · Answer 2 · answered Jan 24 '22 at 16:16

You can use stringr::str_match with ^(\S+(?:\s+\S+)?)?(?:\s+(.+))? regex (see the regex demo online):

library(stringr)
str_match(x, "(?s)^(\\S+(?:\\s+\\S+)?)?(?:\\s+(.+))?")[,-1]
# => [1] "string I"                      "would like to split somewhere"

Also, you can use utils::strcapture:

result <- utils::strcapture("^(\\S+(?:\\s+\\S+)?)?(?:\\s+(.+))?", x, list(left=character(), right=character()))
# > result
#        left                         right
#  1 string I would like to split somewhere
result$left
# => [1] "string I"
result$right
# => [1] "would like to split somewhere"

The ^(\S+(?:\s+\S+)?)?(?:\s+(.+))? regex matches

^ - start of string
(\S+(?:\s+\S+)?)? - an optional group 1:
- \S+ - one or more non-whitespace chars
- (?:\s+\S+)? - an optional occurrence of one or more whitespaces and then one or more non-whitespaces
(?:\s+(.+))? - an optional occurrence of
- \s+ - one or more whitespaces
- (.+) - Group 2: one or more chars other than line break chars as many as possible.

Note that it is not safe to use stringr::str_split here as the only way to use it is with a constrained-width lookbehind:

str_split(x, "(?<=^\\S{1,1000}\\s{1,1000}\\S{1,1000})\\s+")
# => [[1]]
#    [1] "string I"                      "would like to split somewhere"

Since in a constrained-width lookbehind one can't use + or * quantifiers, all you can do is use limiting quantifiers with declared min and max arguments, {1,1000}. Here, 1000 chars is the limit that you need to adjust based on your data.

R: How to separate values only after the second space

2 Answers2