1

I am on a project with R and I am starting to get my hands dirty with it.

In the first part I try to clean the data of vector msg. But later when I build the termdocumentmatrix, these characters still appear. I would like to remove words with less than 4 letters and remove punctuation

gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
gsub("[[:punct:]]", "", pclbyshares$msg) 
corpus <- Corpus(VectorSource(pclbyshares$msg))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=120, highfreq=Inf)
Psidom
  • 209,562
  • 33
  • 339
  • 356
Claudio
  • 63
  • 1
  • 1
  • 7

1 Answers1

0

You haven't stored your first two lines of code as variables to use later. So, in your third line, where you create your corpus variable, you are using the unmodified msg data. Give this a try:

msg_clean <- gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
msg_clean <- gsub("[[:punct:]]", "", msg_clean) 
corpus <- Corpus(VectorSource(msg_clean))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq = 120, highfreq = Inf)
Matt Sandgren
  • 476
  • 1
  • 4
  • 10