I want to retain only pattern words (i.e gene names which I have specified) from each document of my corpus to generate the dtm
. I do not want to pre-process the documents before corpus creation. I want to select and retain the gene names from the corpus only. I have used a custom function to keep only the terms in "pattern" and remove everything else (How to select only a subset of corpus terms for TermDocumentMatrix creation in tm). Here are my codes.
library(tm)
library(Rstem)
library(RTextTools)
docs <- Corpus(DirSource(path of the directory))
# Custom function to keep only the terms in "pattern" and remove everything else
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE)))
# The pattern i want to search for
gene = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, gene)[[1]]
However, I get the error
" Error in UseMethod("content", x) :no applicable method for 'content' applied to an object of class "character" "