Find the names contained in each sentence cycling through a large vector of names

Question

This question is an extension of this one: Find the names contained in each sentence (not the other way around)

I'll write the relevant part here. From this:

> sentences
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther"                                                                    
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments."                                                                          
[5] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

We obtained this result:

library(stringr)
lst <- str_extract_all(sentences, paste(toMatch, collapse="|"))
lst[lengths(lst)==0] <- NA
lst
#[[1]]
#[1] "Martin Luther"

#[[2]]
#[1] "Melanchthon"   "Martin Luther"

#[[3]]
#[1] "Paul"

#[[4]]
#[1] NA

#[[5]]
#[1] "Melanchthon"

But for a large toMatch vector, concatenating its values with the OR operator might not be very efficient. So my question is, how can be the same result be obtained using a function or a loop? Maybe this way it can be used a regular expression like \< or \b aroung the toMatch values so the system only looks for the whole words instead of strings.

I've tried this but don't know how to save the matches in lst to get the same result as above.

for(i in 1:length(sentences)){
    for(j in 1:length(toMatch)){
        lst<-str_extract_all(sentences[i], toMatch[j])
        }}

try this `lst[[i]][j] <- str_extract_all(sentences[i], toMatch[j])` — Prem, Jul 05 '17 at 09:00
Try a nested `lapply` function. Not sure how well that compares to the `or` version, but it's certainly better than a nested loop: `lapply(sentences,function(x) unlist(lapply(toMatch,function(y) str_extract_all(x,y))))` — Val, Jul 05 '17 at 09:06
If you want to check each name individually, you have the upside of using a fixed regex pattern which is c.p. faster than a non-fixed pattern. But of course you have the downside of looping and running the regex multiple times. One option would be `lapply(toMatch, function(m) stringi::stri_extract_all_fixed(sentences, m, simplify = TRUE))` — talat, Jul 05 '17 at 09:30
@docendodiscimus maybe I should rephrase: it's certainly better than _this_ nested loop, meaning the one in the post. — Val, Jul 05 '17 at 09:32
@Prem it gives me an error `Error in `*tmp*`[[i]] : subindex out of limits` @Val the nested lapply works good but when trying with other (larger) data gives a weird error `Error in stri_extract_all_regex(string, pattern, simplify = simplify, : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)` and, secondly, I'm not able to use regexp with `str_extract_all(x,y)`, this `str_extract_all(x,"\\")` gives me a syntax error. — Hoju, Jul 05 '17 at 09:50

score 1 · Answer 1 · answered Jul 05 '17 at 17:50

Are you expecting something like this?

library(stringr)

sentences <- c(
"Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin",
" Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther",
" He studied the Scripture, especially of Paul, and Evangelical doctrine",
" He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments.",                                          
" Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium")

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

for(i in 1:length(sentences)){
  lst[[i]] <- NA * seq(length(toMatch))
  for(j in 1:length(toMatch)){
    tmp = str_extract_all(sentences[i], toMatch[j])
    if (length(tmp[[1]]) > 0) {
      lst[[i]][j] <- tmp[[1]]
    }
  }}
lapply(lst, function(x) x[!is.na(x)])
lst

Find the names contained in each sentence cycling through a large vector of names

1 Answers1

Linked