I created a Term Document Matrix, "myDtm", of a set of keywords contained in a large collections of patents. I want to obtain an ordered, kind of Top 100, list of patents with the highest frequency of keywords.
The code lines are
myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
keywords <- unique(c("labor","cost","autom", "human" ,"person", "intens","reduc","machin","algorithm"))
inspect(myDtm[keywords,tail(order(colSums(v)),100)])
The result looks like this (excerpt):
Terms 2435 33164 27276 1874 20447 41149 35987 21765 798 2461 19249 6822 27640
labor 0 0 0 0 1 0 0 0 0 0 0 0 2
cost 11 0 0 0 13 0 0 0 2 9 0 0 9
autom 0 0 0 0 26 0 0 0 0 0 0 0 0
human 0 0 0 270 150 16 0 279 0 0 54 0 1
person 0 29 0 0 46 3 0 0 0 0 0 0 1
intens 0 0 0 1 0 0 0 0 0 0 0 0 41
reduc 8 0 8 9 13 289 2 12 12 305 292 0 44
machin 264 77 0 0 2 0 0 2 0 0 0 323 31
algorithm 0 0 8 0 0 0 1 0 2 0 0 0 95
The question: How is it possible to exclude outliers, like patent no. 6822? With outliers I mean patents that only include one or two keywords but with a very high frequency. I would like to obtain a top 100 list of the patents that look like patent no. 20447 or 27640, where most keywords are contained. More specifically, is there a way of saying: order the colums by the frequency of keyword mentions AND make sure at least 50% of the keywords are mentioned? ?
Thank you in advance.