I'm trying to plot bigrams from a sample of free comments about meetings held during the last month. I'm using the following method (from the Rweka
package):
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
dtm <- TermDocumentMatrix(modif.corpus.irri.aff(MyComments),
control = list(tokenize = BigramTokenizer)
where modif.corpus.irri.aff()
is my "To-Corpus-format function" (using stem document by the way).
To display the bar plot, the end of the code is this:
dm <- as.matrix(t(dtm))
v <- apply(dm,2,sum)
v <- sort(v, decreasing = TRUE)
v_top <- sort(v[1:nb.terms])
barplot(v_top, horiz=TRUE, cex.names = 0.5,
las = 1, col=grey.colors(10), main="title",
names.arg = names(v_top))
This works quite well but I want to display "pair occurrences" and not "bigram occurrences", because I want to count ideas expressed more than bigrams.
Just an example to be sure:
I want to merge/concatenate the bar of "long meeting_" with the one of"meeting_ long" because it's the same idea: meetings were too long.
Is there a control parameter dealing with this differentiation in NgramTokenizer
? Or something to add?