Exclude outliers in colSums for Term Document Matrix in R

Question

I created a Term Document Matrix, "myDtm", of a set of keywords contained in a large collections of patents. I want to obtain an ordered, kind of Top 100, list of patents with the highest frequency of keywords.

The code lines are

myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
keywords <- unique(c("labor","cost","autom", "human" ,"person", "intens","reduc","machin","algorithm"))
inspect(myDtm[keywords,tail(order(colSums(v)),100)])

The result looks like this (excerpt):

Terms       2435 33164 27276 1874 20447 41149 35987 21765 798 2461 19249 6822 27640
  labor        0     0     0    0     1     0     0     0   0    0     0    0     2
  cost        11     0     0    0    13     0     0     0   2    9     0    0     9
  autom        0     0     0    0    26     0     0     0   0    0     0    0     0
  human        0     0     0  270   150    16     0   279   0    0    54    0     1
  person       0    29     0    0    46     3     0     0   0    0     0    0     1
  intens       0     0     0    1     0     0     0     0   0    0     0    0    41
  reduc        8     0     8    9    13   289     2    12  12  305   292    0    44
  machin     264    77     0    0     2     0     0     2   0    0     0  323    31
  algorithm    0     0     8    0     0     0     1     0   2    0     0    0    95

The question: How is it possible to exclude outliers, like patent no. 6822? With outliers I mean patents that only include one or two keywords but with a very high frequency. I would like to obtain a top 100 list of the patents that look like patent no. 20447 or 27640, where most keywords are contained. More specifically, is there a way of saying: order the colums by the frequency of keyword mentions AND make sure at least 50% of the keywords are mentioned? ?

Thank you in advance.

what do you mean by outlier? how do you define an outlier? "where most keywords are contained" is too vague to formalize this, you need to be way more specific, like: should there be 60% of the words contained or how many? would you could do is count the number of terms per document, plot it with a boxplot and decide this way what the cutoff should be, something like `newDtm <- lapply(myDtm, function(x) ifelse(x > 0, 1,0)); boxplot(colSums(newDtm)` — grrgrrbla, May 22 '15 at 10:08
a patent in which only one or two keywords are represented but used very often, and therefore land in the top100 list (e.g. patent 41149 or 2461). i am only interested in patent with multiple keyword mentions (e.g. patent 20447) — Giuliano Joshua, May 22 '15 at 10:11

grrgrrbla · Answer 1 · 2015-05-22T10:54:09.460

0

the following excludes all the patents where less than 2 keywords are present and gives you a dataframe with only the patents remaining which have more than 2 keywords present:

myDtm[ ,c(TRUE, sapply(myDtm[-1], function(x) sum(ifelse(x > 0, 1,0)) > 2))]

if you wanna do this just for the top 100 just combine the code above with a filter for the rows (which you already have in your code of the OP),

if you want at least 50% of the keywords mentioned, the you would have to do the following:

myDtm[ ,c(TRUE, sapply(myDtm[-1], function(x) sum(ifelse(x > 0, 1,0))/length(x) > 0.5))]

or equivalently:

myDtm[ ,c(TRUE, sapply(myDtm[-1], function(x) mean(ifelse(x > 0, 1,0)) >= 0.5))]

or in functional notation:

cbind(myDtm[1], Filter(function(x) mean(ifelse(x > 0, 1, 0)) >= 0.5, myDtm[-1]))

If you wanna examine the frequency counts make a new df and generate some boxplots, summary-stats etc. (the 1.5 * IQR interquartile range often gets used as a cutoff for outliers for example):

table_Frequency_counts <- sapply(myDtm[-1], function(x) mean(ifelse(x > 0, 1, 0)))
boxplot(table_Frequency_counts)
summary(table_Frequency_counts)
1.5 * IQR(table_Frequency_counts)

edited May 22 '15 at 10:54

answered May 22 '15 at 10:18

grrgrrbla

2,529
2
16
29

hey, thanks so much for your help. this already got me a lot further. i just came across the package "textir" that contains a function called "tf-idf" which ranks the documents according to frequency counts instead of absolute counts (see p.9 here: http://cran.r-project.org/web/packages/textir/textir.pdf) - do you think that would help? – Giuliano Joshua May 22 '15 at 10:42
like I said, I would look at the distribution of the frequency counts using summary statistics and plots (for example boxplots) and then decide for a cutoff, I have no idea what you want to show, do etc. the decision for what is appropriate depends on so many things, so any answer by my side would just be guesswork, I edited my post so that you have the code to do so... – grrgrrbla May 22 '15 at 10:49
you are welcome, please accept the answer by clicking the tick if it does answer your question and click the up-arrow – grrgrrbla May 22 '15 at 11:00

Exclude outliers in colSums for Term Document Matrix in R

1 Answers1