I have a corpus with .txt
documents. From these .txt
documents, I do not need all sentences, but I only want to keep certain sentences that contain specific key words. From there on, I will perform similarity measures etc.
So, here is an example. From the data_corpus_inaugural data set of the quanteda package, I only want to keep the sentences in my corpus that contain the words "future" and/or "children".
I load my packages and create the corpus:
library(quanteda)
library(stringr)
## corpus with data_corpus_inaugural of the quanteda package
corpus <- corpus(data_corpus_inaugural)
summary(corpus)
Then I want to keep only those sentences that contain my key words
## keep only those sentences of a document that contain words future or/and
children
First, let's see which documents contain these key words
## extract all matches of future or children
str_extract_all(corpus, pattern = "future|children")
So far, I only found out how to exclude the sentences that contain my key words, which is the opposite of what I want to do.
## excluded sentences that contains future or children or both (?)
corpustrim <- corpus_trimsentences(corpus, exclude_pattern =
"future|children")
summary(corpustrim)
The above command excludes sentences containing my key words. My idea here with the corpus_trimsentences function is to exclude all sentences BUT those containing "future" and/or "children".
I tried with regular expression. However, I did not manage to do it. It does not return what I want.
I looked into the corpus_reshape
and corpus_subset
functions of the quanteda package but I can't figure out how to use them for my purpose.