5

I am trying to extract the first four digits after a hyphen in the following string: extract_public_2018_20190530180949469_58906_20110101-20111231Texas. I am using the following code:

stringr::str_extract(
"extract_public_2018_20190530180949469_58906_20110101-20111231Texas", 
"-[[:digit:]]{4}"
)

But I get -2011 instead of 2011. How can I only extract the four digits and not the hyphen?

Ashirwad
  • 1,890
  • 1
  • 12
  • 14

3 Answers3

5

Use regex's lookbehind, a non-greedy way of finding something before your pattern without consuming it:

stringr::str_extract(
  "extract_public_2018_20190530180949469_58906_20110101-20111231Texas", 
  "(?<=-)[[:digit:]]{4}"
)
# [1] "2011"
r2evans
  • 141,215
  • 6
  • 77
  • 149
2

str_extract is behaving as expected i.e. it returns the complete match.

You can use str_match and include () in the pattern:

stringr::str_match(
  "extract_public_2018_20190530180949469_58906_20110101-20111231Texas", 
  "-([[:digit:]]{4})"
)

     [,1]    [,2]  
[1,] "-2011" "2011"

Then add [, 2] to return just the match:

stringr::str_match(
  "extract_public_2018_20190530180949469_58906_20110101-20111231Texas", 
  "-([[:digit:]]{4})"
)[, 2]

[1] "2011"
neilfws
  • 32,751
  • 5
  • 50
  • 63
2

In base R, we can sub to extract 4 digits after hyphen.

string <- "extract_public_2018_20190530180949469_58906_20110101-20111231Texas"
sub(".*-(\\d{4}).*", "\\1", string)
#[1] "2011"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213