How to extract sentences containing specific person names using R

Question

I am using R to extract sentences containing specific person names from texts and here is a sample paragraph:

Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.

In this short paragraph, there are several person names such as: Johann Reuchlin, Melanchthon, Johann Eck. With the help of openNLP package, three person names Martin Luther, Paul and Melanchthon can be correctly extracted and recognized. Then I have two questions:

How could I extract sentences containing these names?
As the output of named entity recognizer is not so promising, if I add "[[ ]]" to each name such as [[Johann Reuchlin]], [[Melanchthon]], how could I extract sentences containing these name expressions [[A]], [[B]] ...?

Andrew Taylor · Accepted Answer · 2015-07-21T14:18:42.680

Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]


> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                                    
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"                                                                               
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

Or a little cleaner:

sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]

If you are looking for the sentences that each person is in as separate returns then:

toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)

[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

Edit 3: To add each persons name, do something simple such as:

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}

EDIT 4:

And if you wanted to find sentences that had multiple people/places/things (words), then just add an argument for those two such as:

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")

and change perl to TRUE:

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}


> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"                                                                                                                                         
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] "Paul"                                                                   
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] "Melanchthon"                                                                                                                          
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"                                                                                                     
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

EDIT 5: Answering your other question:

Given:

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])

Will give you the words inside the double brackets.

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"

Many thx, but I notice that for the first and the 4th sentences, there are two person names respectively. If I add the name such as "Johann Eck" or "Johann Reuchlin" to the "toMatch" and run the code above, I still get four sentence output. My new question is how could I get each person's sentence respectively (overlapped) ? — Frown, Jul 21 '15 at 11:12
I don't quite understand. Are you asking for a) only sentences that have all the individuals name in it, or b) a separate return for each individual name (those sentences that have Martin Luther in them, then all sentences that have paul in them, etc)? — Andrew Taylor, Jul 21 '15 at 11:24
Thx, it works!!! Sorry for the ambiguous question. My meaning is the latter one: a separate return for each individual name (those sentences that have Martin Luther in them, then all sentences that have paul in them, etc). Besides.... is there any way to add different person names between different sentences containing them separately, such as '[[2]] **Paul** [1] " He studied the Scripture, especially of Paul, and Evangelical doctrine" — Frown, Jul 21 '15 at 12:55
My gratitude to you is beyond expressions :) One last question correspond s to my second ask in the question, how could I extract sentences containing sth like "[[person A]]", "[[person B]]"... — Frown, Jul 21 '15 at 13:12
Do you mean: Extract a sentence/sentences that contain both person A and person B? — Andrew Taylor, Jul 21 '15 at 13:19
No, it is just a sentence containing regex expressions such as [[...]], and inside the double brackets are person names... — Frown, Jul 21 '15 at 13:30
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/83877/discussion-between-andrew-taylor-and-hui). — Andrew Taylor, Jul 21 '15 at 13:36

score 3 · Answer 2 · answered Jul 22 '15 at 02:25

Here's a considerably simpler method using two packages quanteda and stringi:

sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))

sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."

sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
## 
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
## 
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."

Many thx. I haven't used these two packages before, but it seems very convenient in this case :) — Frown, Jul 26 '15 at 14:09

How to extract sentences containing specific person names using R

2 Answers2

EDIT 4:

EDIT 5: Answering your other question:

Linked