0

I want to retain only pattern words (i.e gene names which I have specified) from each document of my corpus to generate the dtm. I do not want to pre-process the documents before corpus creation. I want to select and retain the gene names from the corpus only. I have used a custom function to keep only the terms in "pattern" and remove everything else (How to select only a subset of corpus terms for TermDocumentMatrix creation in tm). Here are my codes.

    library(tm)
    library(Rstem)
    library(RTextTools)

    docs <- Corpus(DirSource(path of the directory))
    # Custom function to keep only the terms in "pattern" and remove everything else
    f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE)))
    # The pattern i want to search for
    gene = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"

    docs <- tm_map(docs, f, gene)[[1]]

However, I get the error

" Error in UseMethod("content", x) :no applicable method for 'content' applied to an object of class "character" "

Community
  • 1
  • 1
Sushri
  • 21
  • 4

0 Answers0