Non-capturing Group in R Regex

Question

I'm trying to extract the nth word from strings and found several links that suggest a method that doesn't seem to work in R.

myString <- "HANS CHRISTIAN ANDERSON III"

str_extract(myString,'(?:\\S+ ){1}(\\S+)')
# [1] "HANS CHRISTIAN"
str_extract(myString,'(?:\\S+ ){2}(\\S+)')
# [1] "HANS CHRISTIAN ANDERSON"

As you can see, my commands are returning both the non-capturing and capturing group. What's the solution to get only the specific nth word?

score 4 · Accepted Answer · answered Apr 21 '16 at 00:12

4

The Regex is right. It's because you didn't get the group 1 value, but instead, you turn all the caught string by Regex.

library(stringr)

r <- "(?:\\S+ ){1}(\\S+)"
s <- "HANS CHRISTIAN ANDERSON III"

str_match_all(s, r)
#[[1]]
#           [,1]           [,2]  
#[1,] "HANS CHRISTIAN" "CHRISTIAN"

answered Apr 21 '16 at 00:12

Aminah Nuraini

18,120
8
90
108

Is there a way to match using regex as opposed to using `str_match(myString,"(?:\\S+ ){1}(\\S+)")[2]`? – jks612 Apr 21 '16 at 00:20
In this answer, I use `str_match_all`. I think you can't get a group value using `str_match` – Aminah Nuraini Apr 21 '16 at 00:22

score 2 · Answer 2 · answered Apr 21 '16 at 00:20

The negation of character classes is formed when the first character is "^", so this finds all non-space characters and the first space in the first capture class.

# second space delimited name
 gsub( '^([^ ]+[ ])([^ ]+)([ ]+.+$)', "\\2", myString)
[1] "CHRISTIAN"

Another strategy, arguably less failure prone:

# easy to use a numberic value to pick from a scan-read:
 scan(text=myString, what="")[2]
Read 4 items
[1] "CHRISTIAN"

score 2 · Answer 3 · answered Apr 21 '16 at 00:31

2

I'm partial to strsplit:

strsplit(myString, ' ')[[1]][2]
# [1] "CHRISTIAN"

paste(strsplit(myString, ' ')[[1]][1:2], collapse = ' ')
# [1] "HANS CHRISTIAN"

answered Apr 21 '16 at 00:31

alistaire

42,459
4
77
117

1

Indeed, regex can be overkill for this task if the strings are not complex . In vectorised form you'd have: `sapply(strsplit(myString,"\\s+"), \`[\`, 2)` or `vapply(strsplit(myString,"\\s+"), \`[\`, 2, FUN.VALUE=character(1))` if speed matters. – thelatemail Apr 21 '16 at 00:35

Non-capturing Group in R Regex

3 Answers3