Keep only sentences in corpus that contain specific key words (in R)

Question

I have a corpus with .txt documents. From these .txt documents, I do not need all sentences, but I only want to keep certain sentences that contain specific key words. From there on, I will perform similarity measures etc.

So, here is an example. From the data_corpus_inaugural data set of the quanteda package, I only want to keep the sentences in my corpus that contain the words "future" and/or "children".

I load my packages and create the corpus:

library(quanteda)
library(stringr)


## corpus with data_corpus_inaugural of the quanteda package
corpus <- corpus(data_corpus_inaugural)
summary(corpus)

Then I want to keep only those sentences that contain my key words

## keep only those sentences of a document that contain words future or/and 
children

First, let's see which documents contain these key words

## extract all matches of future or children
str_extract_all(corpus, pattern = "future|children")

So far, I only found out how to exclude the sentences that contain my key words, which is the opposite of what I want to do.

## excluded sentences that contains future or children or both (?)
corpustrim <- corpus_trimsentences(corpus, exclude_pattern = 
"future|children")
summary(corpustrim)

The above command excludes sentences containing my key words. My idea here with the corpus_trimsentences function is to exclude all sentences BUT those containing "future" and/or "children".

I tried with regular expression. However, I did not manage to do it. It does not return what I want.

I looked into the corpus_reshape and corpus_subset functions of the quanteda package but I can't figure out how to use them for my purpose.

could you reverse corpus_trimsentences to have all patterns except future and children? — Adam Warner, Jun 13 '18 at 16:14
Dear Adam, thank you for your answer:) Yes that was also my idea when I found the corpus_trimsentences function. I assume that that should work as well, it would be logical. However, I did not manage to do it (with my very very limited regex knowledge). Ken's solution is more straight forward :) But corpus_trimsentences function is one to keep in mind! — vewees, Jun 15 '18 at 06:50

score 3 · Accepted Answer · answered Jun 13 '18 at 18:26

You are correct that it's corpus_reshape() and corpus_subset() that you want here. Here's how to use them.

First, reshape the corpus to sentences.

library("quanteda")

data_corpus_inauguralsents <- 
  corpus_reshape(data_corpus_inaugural, to = "sentences")
data_corpus_inauguralsents

The use stringr to create a logical (Boolean) that indicates the presence or absence of the pattern, equal in length to the new sentence corpus.

containstarget <- 
  stringr::str_detect(texts(data_corpus_inauguralsents), "future|children")
summary(containstarget)
##    Mode   FALSE    TRUE 
## logical    4879     137

Then use corpus_subset() to keep only those with the pattern:

data_corpus_inauguralsentssub <- 
  corpus_subset(data_corpus_inauguralsents, containstarget)
tail(texts(data_corpus_inauguralsentssub), 2)
## 2017-Trump.30 
## "But for too many of our citizens, a different reality exists: mothers and children trapped in poverty in our inner cities; rusted-out factories scattered like tombstones across the landscape of our nation; an education system, flush with cash, but which leaves our young and beautiful students deprived of all knowledge; and the crime and the gangs and the drugs that have stolen too many lives and robbed our country of so much unrealized potential." 
## 2017-Trump.41 
## "And now we are looking only to the future."

Finally, if you want to put these selected sentences back into their original document containers, but without the sentences that did not contain the target words, then reshape again:

# reshape back to documents that contain only sentences with the target terms
corpus_reshape(data_corpus_inauguralsentssub, to = "documents")
## Corpus consisting of 49 documents and 3 docvars.

score 1 · Answer 2 · answered Jun 13 '18 at 16:19

You need to use the tokens function.

library(quanteda)

corpus <- corpus(data_corpus_inaugural)

# tokens to keep
tok_to_keep <- tokens_select(tokens(corpus, what = "sentence"), pattern = "future|children", valuetype = "regex", selection = "keep")

This returns a list of all the speeches and sentences where the key words are present. Next you can unlist the list of tok_to_keep or do whatever you need to it to get what you want.

Keep only sentences in corpus that contain specific key words (in R)

2 Answers2