I have two corpuses that contain similar words. similar enough that using setdiff
doesn't really help my cause. So I've turned towards finding a way to extract a list or corpus (to eventually make a wordcloud) of words that are more frequent (assuming something like this would have a threshold - so maybe like 50% more frequent?) in corpus #1, compared to corpus #2.
This is everything I have right now:
> install.packages("tm")
> install.packages("SnowballC")
> install.packages("wordcloud")
> install.packages("RColorBrewer")
> library(tm)
> library(SnowballC)
> library(wordcloud)
> library(RColorBrewer)
> UKDraft = read.csv("UKDraftScouting.csv", stringsAsFactors=FALSE)
> corpus = Corpus(VectorSource(UKDraft$Report))
> corpus = tm_map(corpus, tolower)
> corpus = tm_map(corpus, PlainTextDocument)
> corpus = tm_map(corpus, removePunctuation)
> corpus = tm_map(corpus, removeWords, c("strengths", "weaknesses", "notes", "kentucky", "wildcats", stopwords("english")))
> frequencies = DocumentTermMatrix(corpus)
> allReports = as.data.frame(as.matrix(frequencies))
> SECDraft = read.csv("SECMinusUKDraftScouting.csv", stringsAsFactors=FALSE)
> SECcorpus = Corpus(VectorSource(SECDraft$Report))
> SECcorpus = tm_map(SECcorpus, tolower)
> SECcorpus = tm_map(SECcorpus, PlainTextDocument)
> SECcorpus = tm_map(SECcorpus, removePunctuation)
> SECcorpus = tm_map(SECcorpus, removeWords, c("strengths", "weaknesses", "notes", stopwords("english")))
> SECfrequencies = DocumentTermMatrix(SECcorpus)
> SECallReports = as.data.frame(as.matrix(SECfrequencies))
So if the word "wingspan" has a 100 count frequency in corpus#2 ('SECcorpus') but 150 count frequency in corpus#1 ('corpus'), we would want that word in our resulting corpus/list.