I'm trying to identify and aggregate synonyms for a given data set. Please see sample data below.
library(tm)
library(SnowballC)
dataset <- c("dad glad accept large admit large accept dad big large big accept big accept dad dad Happy dad accept glad papa dad Happy dad glad dad dad papa admit Happy big accept accept big accept dad Happy admit Happy Happy glad Happy dad accept accept large daddy large accept large large large big daddy accept admit dad admit daddy dad admit dad admit Happy accept accept Happy daddy accept admit")
docs <- Corpus(VectorSource(dataset))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)
Result:
accept dad happy admit large big daddy glad papa
15 14 9 8 8 6 4 4 2
I'd like to find synonyms for each of the above words using the wordnet package that I downloaded and installed. For example to get the synonym of "accept" I can do:
library(wordnet)
setDict("C:/Program Files (x86)/WordNet/2.1/dict")
filter <- getTermFilter("ExactMatchFilter", "accept", TRUE)
terms <- getIndexTerms("VERB", 1, filter)
getSynonyms(terms[[1]])
Result:
[1] "accept" "admit" "assume" "bear" "consent" "go for" "have" "live with"
[9] "swallow" "take" "take on" "take over"
Now, I'd like to combine these two results sets so that it groups synonyms in the following way. Mark the most common words (rank 1) for a given group and group by these words later on similar to this:
id word word_count syn_group rank
1 accept 15 1 1
5 admit 8 1 2
2 dad 14 2 1
8 daddy 4 2 2
9 papa 2 2 3
3 happy 9 3 1
7 glad 4 3 2
4 large 8 4 1
6 big 6 4 2
this then could be aggregated like this
id word word_count
1 accept 15+8
2 dad 14+4+2
3 happy 9+4
4 large 8+6
and the final result would be then
id word word_count
1 accept 23
2 dad 20
3 large 14
4 happy 13
I have faced several issues including getting GetIndexTerms to loop through through the words whether they are nouns ,verbs, etc. Hope this all makes sense? Any help would be much appreciated. Thank you.