5

This is a follow up to this question: Concatenate previous and latter words to a word that match a condition in R

I am looking for a regex which splits the string at the second space that happens after comma. Look at the example below:

vector <- c("Paulsen", "Kehr,", "Diego", 
            "Schalper", "Sepúlveda,", "Alejandro",
             "Von Housen", "Kush,", "Terry")

X <- paste(vector, collapse = " ")
X

## this is the string I am looking to split:
"Paulsen Kehr, Diego Schalper Sepúlveda, Diego Von Housen Kush, Terry"

Second space after each comma is the criterion for my . So, my output will be:

"Paulsen Kehr, Diego"
"Schalper Sepúlveda, Alejandro"
"Von Housen Kush, Terry"

I came up with a pattern but it is not quite working.

[^ ]+ [^ ]+, [^ ]+( )

Using it with strsplit removes all the words instead of splitting at group-1 (i.e. [^ ]+ [^ ]+, [^ ]+(group-1)) only. I think I just needs to exclude the full match and match with the space afterwards only. -- regex demo

strsplit(X, "[^ ]+ [^ ]+, [^ ]+( )")

# [1] "" [2] "" [3] "Von Housen Kush, Terry"

Can anyone think of a for finding the second space after each comma?

M--
  • 25,431
  • 8
  • 61
  • 93

1 Answers1

7

You may use

> strsplit(X, ",\\s+\\S+\\K\\s+", perl=TRUE)
[[1]]
[1] "Paulsen Kehr, Diego"           "Schalper Sepúlveda, Alejandro" "Von Housen Kush, Terry"

See the regex demo

Details

  • , - a comma
  • \s+ - 1+ whitespaces
  • \S+ - 1+ non-whitespaces
  • \K - match reset operator discarding all text matched so far
  • \s+ - 1+ whitespaces
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563