3

I have a list of references, e.g.,

references <- c(
  "Dumitru, T.A., Smith, D., Chang, E.Z., and Graham, S.A., 2001, Uplift, exhumation, and deformation in the Japanese Mt Everest, Paleozoic and Mesozoic tectonic evolution of central Africa: from continental assembly to intracontinental deformation: Journal of Neverland, v. 3, no. 192, p. 71-199.",
  "Dumitru, T.A., Smith, D., Chang, E.Z., and Graham, S.A., 2001, Uplift, exhumation, and deformation in the Japanese Mt Everest, Paleozoic and Mesozoic tectonic evolution of central Africa: from continental assembly to intracontinental deformation: Journal of Neverland, no. 3.",
  "Dumitru, T.A., Smith, D., Chang, E.Z., and Graham, S.A., 2001, Uplift, exhumation, and deformation in the Japanese Mt Everest, Paleozoic and Mesozoic tectonic evolution of central Africa: from continental assembly to intracontinental deformation: Journal of Neverland, p. 71-199."
)

I've tried (?<=:)(?.*)(?=(v\.)|(no\.)|(p\.)) but the regex returned 'from continental assembly to intracontinental deformation: Journal of Neverland, v. 3, no. 192, p.' not what I intended to extract.

(?<=:)(?:[^:].*?)(?=(, v\.)|(, no\.)|(, p\.))

What I expect is 'Journal of Neverland' but the return is ' from continental assembly to intracontinental deformation: Journal of Neverland'

Emma
  • 27,428
  • 11
  • 44
  • 69
Jiulin Guo
  • 31
  • 2

3 Answers3

4

Here we just match the text before the last colon up to the next comma in a capture group

stringr::str_match(references, ": ((?!:)[^,:]*),")[,2]
# [1] "Journal of Neverland" "Journal of Neverland" "Journal of Neverland"
MrFlick
  • 195,160
  • 17
  • 277
  • 295
3

You may use

:\s*\K[^:]*?(?=,\s*(?:v|no|p)\.)

See the regex demo

Details

  • : - a colon
  • \s* - 0+ whitespaces
  • \K - match reset operator
  • [^:]*? - zero or more chars other than : but as few as possible as *? is non-greedy
  • (?=,\s*(?:v|no|p)\.) - a positive lookahead that requires a ,, then 0+ whitespaces, and then v, no or p followed with a . immediately to the right of the current location.

In R:

regmatches(references, regexpr(":\\s*\\K[^:]*?(?=,\\s*(?:v|no|p)\\.)", references, perl=TRUE))

See R demo online:

references <- c(
  "Dumitru, T.A., Smith, D., Chang, E.Z., and Graham, S.A., 2001, Uplift, exhumation, and deformation in the Japanese Mt Everest, Paleozoic and Mesozoic tectonic evolution of central Africa: from continental assembly to intracontinental deformation: Journal of Neverland, v. 3, no. 192, p. 71-199.",
  "Dumitru, T.A., Smith, D., Chang, E.Z., and Graham, S.A., 2001, Uplift, exhumation, and deformation in the Japanese Mt Everest, Paleozoic and Mesozoic tectonic evolution of central Africa: from continental assembly to intracontinental deformation: Journal of Neverland, no. 3.",
  "Dumitru, T.A., Smith, D., Chang, E.Z., and Graham, S.A., 2001, Uplift, exhumation, and deformation in the Japanese Mt Everest, Paleozoic and Mesozoic tectonic evolution of central Africa: from continental assembly to intracontinental deformation: Journal of Neverland, p. 71-199."
)
regmatches(references, regexpr(":\\s*\\K[^:]*?(?=,\\s*(?:v|no|p)\\.)", references, perl=TRUE))
## => [1] "Journal of Neverland" "Journal of Neverland" "Journal of Neverland"

If you prefer a stringr based solution, use either

> str_extract(references, "(?<=:\\s)[^:]*?(?=,\\s*(?:v|no|p)\\.)")
[1] "Journal of Neverland" "Journal of Neverland" "Journal of Neverland"

Or, if the whitespace after : can be 0 or many:

> str_match(references, ":\\s*([^:]*?)(?:,\\s*(?:v|no|p)\\.)")[,2]
[1] "Journal of Neverland" "Journal of Neverland" "Journal of Neverland"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Getting to know more about regex: the '(?=(, v\\.)|(, [(\\d{0,})|(p\\.)])|(, no\\.))' can literally be '(?=,\\s*(?:v|no|p|\d+p)\\.)' . for not containing the ':', use [^:]*? (non-greedy). Thanks for the suggestion. Really helpful! – Jiulin Guo May 24 '19 at 05:54
  • @JiulinGuo If my solution worked for you please consider accepting/upvoting the answer. Let know if anything is still unclear. – Wiktor Stribiżew May 24 '19 at 07:35
1

Here is a gsub solution

gsub('.*: (.*?), (?=v|no|p).*','\\1', references, perl=TRUE)
# [1] "Journal of Neverland" "Journal of Neverland" "Journal of Neverland"

Alternatively, one can also use strsplit

vapply(strsplit(references, ': *|, *', perl=TRUE),
       function (l) {
         k <- which(startsWith(l, 'p. ') | startsWith(l, 'v. ') | startsWith(l, 'no. '))
         k <- k[1] - 1
         return (l[k]) 
       }, character (1))
# [1] "Journal of Neverland" "Journal of Neverland" "Journal of Neverland"
niko
  • 5,253
  • 1
  • 12
  • 32