I am on a project with R and I am starting to get my hands dirty with it.
In the first part I try to clean the data of vector msg. But later when I build the termdocumentmatrix
, these characters still appear.
I would like to remove words with less than 4 letters and remove punctuation
gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
gsub("[[:punct:]]", "", pclbyshares$msg)
corpus <- Corpus(VectorSource(pclbyshares$msg))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=120, highfreq=Inf)