Identifying and grouping synonyms in R

Question

I'm trying to identify and aggregate synonyms for a given data set. Please see sample data below.

library(tm)
library(SnowballC)

dataset <- c("dad glad accept large admit large accept dad big large big accept big accept dad dad Happy dad accept glad papa dad Happy dad glad dad dad papa admit Happy big accept accept big accept dad Happy admit Happy Happy glad Happy dad accept accept large daddy large accept large large large big daddy accept admit dad admit daddy dad admit dad admit Happy accept accept Happy daddy accept admit")

docs <- Corpus(VectorSource(dataset))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)

Result:

accept    dad  happy  admit  large    big  daddy   glad   papa 
    15     14      9      8      8      6      4      4      2

I'd like to find synonyms for each of the above words using the wordnet package that I downloaded and installed. For example to get the synonym of "accept" I can do:

library(wordnet)
setDict("C:/Program Files (x86)/WordNet/2.1/dict")

filter <- getTermFilter("ExactMatchFilter", "accept", TRUE)
terms <- getIndexTerms("VERB", 1, filter)
getSynonyms(terms[[1]])

Result:

 [1] "accept"    "admit"     "assume"    "bear"      "consent"   "go for"    "have"      "live with"
 [9] "swallow"   "take"      "take on"   "take over"

Now, I'd like to combine these two results sets so that it groups synonyms in the following way. Mark the most common words (rank 1) for a given group and group by these words later on similar to this:

id  word    word_count  syn_group   rank
1   accept  15          1           1
5   admit   8           1           2
2   dad     14          2           1
8   daddy   4           2           2
9   papa    2           2           3
3   happy   9           3           1
7   glad    4           3           2
4   large   8           4           1
6   big     6           4           2

this then could be aggregated like this

id  word    word_count
1   accept  15+8
2   dad     14+4+2
3   happy   9+4
4   large   8+6

and the final result would be then

id  word    word_count
1   accept  23
2   dad     20
3   large   14
4   happy   13

I have faced several issues including getting GetIndexTerms to loop through through the words whether they are nouns ,verbs, etc. Hope this all makes sense? Any help would be much appreciated. Thank you.

FYI: go to Wordnet.princeton.edu and download the version of WordNet for your operating system. After you install it, you can pick up from GraveDigger's code after library(wordnet). — lawyeR, May 30 '17 at 00:40

shayaa · Answer 1 · 2017-02-24T00:25:32.467

We can do the following using dplyr

library(dplyr)
df %>% 
  group_by(syn_group) %>%
  mutate(sum_word_count = sum(word_count)) %>% 
  filter(rank == 1)

Data:

df <- read.table(text = "id  word    word_count  syn_group   rank
1   accept  15          1           1
5   admit   8           1           2
2   dad     14          2           1
8   daddy   4           2           2
9   papa    2           2           3
3   happy   9           3           1
7   glad    4           3           2
4   large   8           4           1
6   big     6           4           2", header = T)

Please next time post the output of dput.

Edit: Here is some code to help you get started looping over the words and parts of speech, and storing the synonyms. What is left is to determine if a current term is a synonym of a previous one, in which case you already have the synonyms, and you can assign a unique syngroup. Next, you need to store some results. Finally, you need to calculate the rank, which is just seq_along the synonyms and a grep to determine the rank position. The comments are hints at where you might want to include code for these tips.

d <- data.frame(Term = row.names(m), word_count = m[,1])
all_pos <- c("ADJECTIVE", "ADVERB", "NOUN","VERB")
syns <- vector("list", length(all_pos))
for(w in seq(nrow(d))){
  # if sysns of (d$Term[w]) has been calculated skip over current w 
  emf <- getTermFilter("ExactMatchFilter", d$Term[w], TRUE)  
  for(i in seq_along(syns)){
    terms <- getIndexTerms(all_pos[i], 1, emf)
    if(is.null(terms)){
      syns[i] <- NA
    } else{
      syns[[i]] <-  getSynonyms(terms[[1]])
    }
  }
  # store the results of syns for current w 
}

Thanks for your response. The problem is that I don't have that table (see your comment), that's what I'm trying to come up with. All I got is the character vector with a bunch of words and access to wordnet. Any other suggestion? Thanks again! — GraveDigger, Feb 22 '17 at 23:34
@GraveDigger, edited to include enough information to help you to solve this problem. — shayaa, Feb 23 '17 at 20:04
Thank you!! I'm rather new to R so don't understand everything you did, will need to spend some time figuring out what you did and why. — GraveDigger, Feb 28 '17 at 22:51

Identifying and grouping synonyms in R

1 Answers1