Extract all substrings meeting criteria using R regex

Question

I have an extremely long string in R and would like to extract all substrings that match a certain criteria. The string may look something like this: "some text some text some text [ID: 1234] some text some text [ID: 5678] some text some text [ID: 9999]."

I have seen other questions posted like this that offer gsub as a solution but that seems to be in the scenario when only one substring needs to be extracted and not multiple.

What I would like to achieve as a result is a vector like this:

c("[ID: 1234]", "[ID: 5678]", "[ID: 9999]")

See `?stringr::str_extract_all` – Gregor Thomas Aug 23 '19 at 18:29 — Gregor Thomas, Aug 23 '19 at 18:29

score 3 · Accepted Answer · answered Aug 23 '19 at 18:37

3

x = "some text some text some text [ID: 1234] some text some text [ID: 5678] some text some text [ID: 9999]."
unlist(stringr::str_extract_all(x, "\\[ID: \\d+\\]"))
# [1] "[ID: 1234]" "[ID: 5678]" "[ID: 9999]"

answered Aug 23 '19 at 18:37

Gregor Thomas

136,190
20
167
294

score 2 · Answer 2 · answered Aug 23 '19 at 18:32

2

Using base R, an option would be

regmatches(text, gregexpr(pattern, text))

which you can then unlist() if you want your output as an atomic vector.

answered Aug 23 '19 at 18:32

Hayden Y.

448
2
8

Ok great that seems to be what I'm looking for thank you! I am not too savvy it terms of regular expressions. I think what I want to start out with is something like this: ".*\\[ID: ". What I am stuck on now is how to allow any number of numeric characters and then anchor the last "]" to the end of the string. Would you be able to assist with that? – Jordan Hackett Aug 23 '19 at 18:38
2

@JordanHackett you can use the regex pattern in my answer with any regex method. `"\\[ID: \\d+\\]"` – Gregor Thomas Aug 23 '19 at 18:38
Great that's awesome! Thanks @Gregor that works perfectly! – Jordan Hackett Aug 23 '19 at 18:40

score 0 · Answer 3 · answered Aug 23 '19 at 18:48

0

inds = gregexpr("\\[ID: \\d+\\]", x)
lapply(inds, function(i){
    substring(x, i, i + attr(i, "match.length") - 1)
})
#[[1]]
#[1] "[ID: 1234]" "[ID: 5678]" "[ID: 9999]"

answered Aug 23 '19 at 18:48

d.b

32,245
6
36
77

Extract all substrings meeting criteria using R regex

3 Answers3