I'm having trouble using a RegEx on a corpus.
I read in a couple of text documents that I converted to a corpus. I want to display it in a TermDocumentMatrix after some pre-processing.
First I want to specify them with the RegEx "(\b([a-z]*)\B)". For example for "the host" -> "th" "hos"
Then I want to use character n-grams with n = 1:3, so for the previous example -> t" "th", "h", "ho", "hos" Hence I want all characters that define the beginning of the word but do not include the last character of it.
My code so far is giving me a TermDocumentMatrix with n = 1:3 on the whole corpus. However all my approaches to add the RegEx so far haven't beeen working.
I was wondering if there's a way to include in: typedPrefix <- tokens()...
Here's the code:
# read documents
FILEDIR <- (path)
txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
my_corpus <- corpus(txts)
#start processing
typedPrefix <- my_corpus
typedPrefix <- tokens(gsub("\\s", "_", typedPrefix), "character", ngrams=1:3, conc="", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
dfm2 <- dfm(typedPrefix)
tdm2 <- as.TermDocumentMatrix(t(dfm2), weighting=weightTf)
as.matrix(tdm2)
#write output file
write.csv2(as.matrix(tdm2), file = "typedPrefix.csv")