4

I have a column with different names:

X <- c("Ashley, Tremond WILLIAMS, Carla", "Claire, Daron", "Luw, Douglas CANSLER, Stephan")

After the second space, it starts the name of the second person. For instance, Ashley, Tremond is a person and WILLIAMS, Carla another one.

I have tried:

strsplit(X, "\\,\\s|\\,|\\s")

but it divides by all the spaces, so i get:

strsplit(X, "\\,\\s|\\,|\\s")
[[1]]
[1] "Ashley"   "Tremond"  "WILLIAMS" "Carla"   

[[2]]
[1] "Claire" "Daron" 

[[3]]
[1] "Luw"     "Douglas" "CANSLER" "Stephan"

How can I separate only after the first space, so I get?:

[1] "Ashley, Tremond"  "WILLIAMS, Carla"   

[[2]]
[1] "Claire, Daron" 

[[3]]
[1] "Luw, Douglas" "CANSLER, Stephan"

Thanks in advance for all your help

ytk
  • 2,787
  • 4
  • 27
  • 42
Natalia P
  • 97
  • 1
  • 7
  • 2
    `strsplit(X, "[^,] ")` gives the desired output. It splits the string where a space is not preceded by a comma. – ytk Jan 26 '17 at 21:31
  • 1
    You'll want to unlist it to maintain the vector: `unlist(strsplit(X, split = "[A-z] [A-z]"))` – Ryan Morton Jan 26 '17 at 21:34
  • @RyanMorton , if you skip the `unlist` call, it preserves the grouping level of names in the original input, and matches the expected output – Aramis7d Jan 26 '17 at 23:59
  • The expected outcome was edited into the post after my original response, but yes. strsplit() returns a list. – Ryan Morton Jan 27 '17 at 00:01
  • @ykt and Ryan thanks so much for your help, it works – Natalia P Jan 27 '17 at 15:44

2 Answers2

1

Of course @ytk's comment works, but in case you want to avoid the regex, you can be sneaky and do

df2 <- df %>%
  separate(col = X, into=c("person1a","person1b","person2a","person2b"),sep= " ") %>%
  unite(col = "person1", person1a, person1b, sep=" ") %>%
  unite(col = "person2", person2a, person2b, sep=" ") 

which returns:

> df2
          person1          person2
1 Ashley, Tremond  WILLIAMS, Carla
2   Claire, Daron            NA NA
3    Luw, Douglas CANSLER, Stephan

p.s. I use df <- data.frame(X = c("Ashley, Tremond WILLIAMS, Carla", "Claire, Daron", "Luw, Douglas CANSLER, Stephan")) to make the input into a dataframe.

Aramis7d
  • 2,444
  • 19
  • 25
  • thanks but I writing the exact same code and it is not working for me, and I don't' really understand it, what does the %>% means? – Natalia P Jan 27 '17 at 15:48
  • @NataliaP it's a way of `piping` syntax , check out the `magrittr` package. – Aramis7d Jan 27 '17 at 17:24
1

You can use stringr::str_match with ^(\S+(?:\s+\S+)?)?(?:\s+(.+))? regex (see the regex demo online):

library(stringr)
str_match(x, "(?s)^(\\S+(?:\\s+\\S+)?)?(?:\\s+(.+))?")[,-1]
# => [1] "string I"                      "would like to split somewhere"

Also, you can use utils::strcapture:

result <- utils::strcapture("^(\\S+(?:\\s+\\S+)?)?(?:\\s+(.+))?", x, list(left=character(), right=character()))
# > result
#        left                         right
#  1 string I would like to split somewhere
result$left
# => [1] "string I"
result$right
# => [1] "would like to split somewhere"

The ^(\S+(?:\s+\S+)?)?(?:\s+(.+))? regex matches

  • ^ - start of string
  • (\S+(?:\s+\S+)?)? - an optional group 1:
    • \S+ - one or more non-whitespace chars
    • (?:\s+\S+)? - an optional occurrence of one or more whitespaces and then one or more non-whitespaces
  • (?:\s+(.+))? - an optional occurrence of
    • \s+ - one or more whitespaces
    • (.+) - Group 2: one or more chars other than line break chars as many as possible.

Note that it is not safe to use stringr::str_split here as the only way to use it is with a constrained-width lookbehind:

str_split(x, "(?<=^\\S{1,1000}\\s{1,1000}\\S{1,1000})\\s+")
# => [[1]]
#    [1] "string I"                      "would like to split somewhere"

Since in a constrained-width lookbehind one can't use + or * quantifiers, all you can do is use limiting quantifiers with declared min and max arguments, {1,1000}. Here, 1000 chars is the limit that you need to adjust based on your data.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563