R - comparing two corpuses to create a NEW corpus with words with higher frequency from corpus #1

Question

I have two corpuses that contain similar words. similar enough that using setdiff doesn't really help my cause. So I've turned towards finding a way to extract a list or corpus (to eventually make a wordcloud) of words that are more frequent (assuming something like this would have a threshold - so maybe like 50% more frequent?) in corpus #1, compared to corpus #2.

This is everything I have right now:

> install.packages("tm")
> install.packages("SnowballC")
> install.packages("wordcloud")
> install.packages("RColorBrewer")
> library(tm)
> library(SnowballC)
> library(wordcloud)
> library(RColorBrewer)

> UKDraft = read.csv("UKDraftScouting.csv", stringsAsFactors=FALSE)
> corpus = Corpus(VectorSource(UKDraft$Report))
> corpus = tm_map(corpus, tolower)
> corpus = tm_map(corpus, PlainTextDocument)
> corpus = tm_map(corpus, removePunctuation)
> corpus = tm_map(corpus, removeWords, c("strengths", "weaknesses", "notes",  "kentucky", "wildcats", stopwords("english")))
> frequencies = DocumentTermMatrix(corpus)
> allReports = as.data.frame(as.matrix(frequencies))

> SECDraft = read.csv("SECMinusUKDraftScouting.csv", stringsAsFactors=FALSE)
> SECcorpus = Corpus(VectorSource(SECDraft$Report))
> SECcorpus = tm_map(SECcorpus, tolower)
> SECcorpus = tm_map(SECcorpus, PlainTextDocument)
> SECcorpus = tm_map(SECcorpus, removePunctuation)
> SECcorpus = tm_map(SECcorpus, removeWords, c("strengths", "weaknesses", "notes", stopwords("english")))
> SECfrequencies = DocumentTermMatrix(SECcorpus)
> SECallReports = as.data.frame(as.matrix(SECfrequencies))

So if the word "wingspan" has a 100 count frequency in corpus#2 ('SECcorpus') but 150 count frequency in corpus#1 ('corpus'), we would want that word in our resulting corpus/list.

hey @kebs, i added an example at the bottom. does that make sense? — SpicyClubSauce, May 29 '15 at 19:55
Much clearer, but don't count on me to answer, I was just reviewing and trying to make your question more attractive ;-) — kebs, May 29 '15 at 21:46

Ken Benoit · Answer 1 · 2015-06-01T17:31:45.410

3

I can suggest a method that might be more straightforward, based on the new text analysis package I developed with Paul Nulty. It's called quanteda, available on CRAN and GitHub.

I don't have access to your texts, but this will work in a similar fashion for your examples. You create a corpus of your two sets of documents, then add a document variable (using docvars), and then create a document feature matrix grouping on the new document partition variable. The rest of the operations are straightforward, see the code below. Note that by default, dfm objects are sparse Matrixes, but subsetting on features is not yet implemented (next release!).

install.packages(quanteda)
library(quanteda)

# built-in character vector of 57 inaugural addreses
str(inaugTexts)

# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts, 
                    docvars = data.frame(docset = c(rep(1, 29), rep(2, 28))),
                    notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)

# toLower, removePunct are on by default
inaugDfm <- dfm(inaugCorp, 
                groups = "docset", # by docset instead of document
                ignoredFeatures = c("strengths", "weaknesses", "notes", stopwords("english"))),
                matrixType = "dense")

# now compare frequencies and trim based on ratio threshold
ratioThreshold <- 1.5
featureRatio <- inaugDfm[2, ] / inaugDfm[1, ]
# to select where set 2 feature frequency is 1.5x set 1 feature frequency
inaugDfmReduced <- inaugDfm[2, featureRatio >= ratioThreshold]

# plot the wordcloud
plot(inaugDfmReduced)

I would recommend you pass through some options to wordcloud() (what plot.dfm() uses), perhaps to restrict the minimum number of features to be plotted.

Very happy to assist with any queries you might have on using the quanteda package.

New

Here's a stab directly at your problem. I don't have your files so cannot verify that it works. Also if your R skills are limited, you might find this challenging to understand; ditto if you have not looked at any of the (sadly limited for now) documentation for quanteda.

I think what you need (based on your comment/query) is the following:

# read in each corpus separately, directly into quanteda
mycorpus1 <- corpus(textfile("UKDraftScouting.csv", textField = "Report"))
mycorpus2 <- corpus(textfile("SECMinusUKDraftScouting.csv", textField = "Report"))
# assign docset variables to each corpus as appropriate 
docvars(mycorpus1, "docset") <- 1 
docvars(mycorpus2, "docset") <- 2
myCombinedCorpus <- mycorpus1 + mycorpus2

then proceed with the dfm step as above, substituting myCombinedCorpus for inaugTexts.

edited Jun 01 '15 at 17:31

answered May 30 '15 at 21:54

Ken Benoit

14,454
27
50

thanks for your response/suggestion. I'm trying to compare two corpuses of different sizes and different origins (so, not partitioning a corpus to two smaller corpuses as you do in your example) - when i try this though: Dfm <- dfm(corpus, SECcorpus, ignoredFeatures = c("strengths", "weaknesses", "notes", "outlook", stopwords("english")), matrixType = "dense") i get an Error like this: Error in UseMethod("dfm") : no applicable method for 'dfm' applied to an object of class "c('VCorpus', 'Corpus')" could you point me in the right direction Ken? thanks. – SpicyClubSauce Jun 01 '15 at 15:12
What you can do here is use the overloaded "+" operator to combine your two corpora, and then use dfm. You need to add a document variable distinguishing them as in the example for `docset`. The dfm error you are getting comes from you trying to send two corpus arguments to `dfm()`, one of which is not a quanteda corpus. First, make it a quanteda corpus using `corpus()` as in the example. – Ken Benoit Jun 01 '15 at 15:16
My confusion with your `corpus()` in the example is that you are partitioning one corpus, `inaugTexts` into two separate corpus, if i'm following correctly. My issue is that my two corpora are entirely separate text datasets from each other, and i'm having trouble understanding how the `corpus()` example would work when they're not derived from the same parent corpus like `inaugTexts`. – SpicyClubSauce Jun 01 '15 at 15:23
basically - how would I combine my two corpora (`corpus` and `SECcorpus`) and then distinguish them? thanks in advance, I really don't know much R, i'm much more fluent in Python but trying to become familiar with R as well. – SpicyClubSauce Jun 01 '15 at 15:25
thanks Ken. looks clearer to me now. Is there a typo or something wrong with your initial assignment statement though? I'm getting this error for when I try to assign what you wrote to mycorpus1 and mycorpus2: `Error in get_csv(file, textField, ...) : column namereportnot found.` I'm in the proper working directory and have installed your package and everything. – SpicyClubSauce Jun 01 '15 at 16:06
oh, and i put my actual .csv files up here if you're wanting to take an actual look yourself to see if it works: https://github.com/yongcho822/NBADraft-Wordclouds – SpicyClubSauce Jun 01 '15 at 16:07
I've tried supplying the exact path of the file as you do here (http://rpackages.ianhowson.com/cran/quanteda/man/textfile.html) but that's working to no avail either. quanteda seems really useful, just wish i could read my files into it. – SpicyClubSauce Jun 01 '15 at 16:24
What's the error message you are getting? Probably you are just not finding the filename due to path issues. Simplest solution is to make the directory with your two files your current director. (In RStudio, Session->Set Working Directory, or from the equivalent if you are using the R console (probably Ctl-D on your system). – Ken Benoit Jun 01 '15 at 16:36
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/79343/discussion-between-spicyclubsauce-and-ken-benoit). – SpicyClubSauce Jun 01 '15 at 16:43

score 0 · Answer 2 · answered Apr 06 '17 at 05:15

I am updating the answer by @ken Benoit, as it was several years old and quanteda package has gone through some major changes in syntax.

The current version should be (April 2017):

str(inaugTexts)

# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts, 
                docvars = data.frame(docset = c(rep(1, 29), rep(2, 29))),
                notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)


inaugDfm <- dfm(comment_corpus, 
            groups = "docset", # by docset instead of document
            remove = c("<p>", "http://", "www", stopwords("english")),
            remove_punct = TRUE,
            remove_numbers = TRUE,
            stem = TRUE)

R - comparing two corpuses to create a NEW corpus with words with higher frequency from corpus #1

2 Answers2