Find the names contained in each sentence (not the other way around)

Question

My question is an extension of this one: How to extract sentences containing specific person names using R

I'll write the relevant part here (slightly edited for the sake of this question):

> sentences
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther"                                                                    
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments."                                                                          
[5] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

The answer provided gives the sentences that match each name:

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"                                                                                                                                         
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[3] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther"

[[2]]
[1] "Paul"                                                                   
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] "Melanchthon"                                                                                                                          
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther"                                                   
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

lapply(toMatch,foo) gives a list of toMatch elements and apply each one to the function foo, which search for matches in the sentences with grep (returning the position of the sentences vector that match): sentences[grep(Match,sentences)].

My question is, instead of returning every sentence that match the elements of the toMatch vector, how could we have a vector with every sentence and then look for the names that match each one (i.e: the other way around, I know it's a bit confusing, the output would be this):

[1] "Martin Luther"
[2] "Melanchthon","Martin Luther"                                                                    
[3] "Paul"
[4] NA                   #Or maybe this row doesn't exists, it's the same for me                                                               
[5] "Melanchthon"

Could this be done altering the result already provided or maybe this would be easier using a different function and lapply(sentences,FUNCTION)?

akrun · Accepted Answer · 2017-07-05T04:08:25.597

1

One option would be str_extract

library(stringr)
lst <- str_extract_all(sentences, paste(toMatch, collapse="|"))
lst[lengths(lst)==0] <- NA
lst
#[[1]]
#[1] "Martin Luther"

#[[2]]
#[1] "Melanchthon"   "Martin Luther"

#[[3]]
#[1] "Paul"

#[[4]]
#[1] NA

#[[5]]
#[1] "Melanchthon"

Or we can use regmatches/gregexpr from base R

lst <- regmatches(sentences, gregexpr(paste(toMatch, collapse="|"), sentences))

and replace the list elements having 0 length as NA (as before)

edited Jul 05 '17 at 04:08

answered Jul 05 '17 at 04:03

akrun

874,273
37
540
662

That's perfect, thanks. One thing, if `"Paul"` were present 4 times in setences[3], the ouput for your code would be `"Paul", "Paul", "Paul", "Paul"` Would it be possible to get each name just once each sentence? – Hoju Jul 05 '17 at 04:54
1

@Hoju Currently, it will get all the `Paul`s, but if you need only `unique`, then `lapply(lst, unique)` would do it – akrun Jul 05 '17 at 04:57
One thought, in the case that the `toMatch` names vector is very big, do you think that concatenating the names with a lot of OR operators is less efficient than using the function approach `foo<-function(Match){sentences[grep(Match,sentences)]}`? Can you think of any way to solve the question using a function like that? I think maybe it'd be faster because it could be used something like `grep("\\",sentences)` so it only looks for words instead of strings. – Hoju Jul 05 '17 at 07:56
@Hoju In that case, you may need to loop through each element and execute it separately – akrun Jul 05 '17 at 08:01
The problem is to cycle through the names with `foo<-function(Match){str_extract_all(sentences, Match)}` you give the function the `toMatch` names vector as an argument (plus, in this case, it'd return an 3x5 matrix). Maybe with a for? `for(i in 1:length(sentences)){ }` – Hoju Jul 05 '17 at 08:28
`for(i in 1:lengths(sentences)){ / for(j in 1:lengths(toMatch)){ / lst<-str_extract_all(sentences[i], toMatch[j]) / }}` this only uses the first element dont know why – Hoju Jul 05 '17 at 08:33
change lengths for length, every cycle overwrite the value of `lst` – Hoju Jul 05 '17 at 08:39
@Hoju Can u post as a new question. The `lengths` is for finding the `length` of each `list` element – akrun Jul 05 '17 at 08:41
changed `lst` for `lst[i;j]` but gives me syntax error – Hoju Jul 05 '17 at 08:42
sure, I'll post a new question :) – Hoju Jul 05 '17 at 08:42
https://stackoverflow.com/questions/44921411/find-the-names-contained-in-each-sentence-cycling-through-a-large-vector-of-name – Hoju Jul 05 '17 at 08:54

Find the names contained in each sentence (not the other way around)

1 Answers1

Linked