2

I have an extremely long string in R and would like to extract all substrings that match a certain criteria. The string may look something like this: "some text some text some text [ID: 1234] some text some text [ID: 5678] some text some text [ID: 9999]."

I have seen other questions posted like this that offer gsub as a solution but that seems to be in the scenario when only one substring needs to be extracted and not multiple.

What I would like to achieve as a result is a vector like this:

c("[ID: 1234]", "[ID: 5678]", "[ID: 9999]")
d.b
  • 32,245
  • 6
  • 36
  • 77
Jordan Hackett
  • 689
  • 3
  • 11

3 Answers3

3
x = "some text some text some text [ID: 1234] some text some text [ID: 5678] some text some text [ID: 9999]."
unlist(stringr::str_extract_all(x, "\\[ID: \\d+\\]"))
# [1] "[ID: 1234]" "[ID: 5678]" "[ID: 9999]"
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
2

Using base R, an option would be

regmatches(text, gregexpr(pattern, text)) 

which you can then unlist() if you want your output as an atomic vector.

Hayden Y.
  • 448
  • 2
  • 8
  • Ok great that seems to be what I'm looking for thank you! I am not too savvy it terms of regular expressions. I think what I want to start out with is something like this: ".*\\[ID: ". What I am stuck on now is how to allow any number of numeric characters and then anchor the last "]" to the end of the string. Would you be able to assist with that? – Jordan Hackett Aug 23 '19 at 18:38
  • 2
    @JordanHackett you can use the regex pattern in my answer with any regex method. `"\\[ID: \\d+\\]"` – Gregor Thomas Aug 23 '19 at 18:38
  • Great that's awesome! Thanks @Gregor that works perfectly! – Jordan Hackett Aug 23 '19 at 18:40
0
inds = gregexpr("\\[ID: \\d+\\]", x)
lapply(inds, function(i){
    substring(x, i, i + attr(i, "match.length") - 1)
})
#[[1]]
#[1] "[ID: 1234]" "[ID: 5678]" "[ID: 9999]"
d.b
  • 32,245
  • 6
  • 36
  • 77