0

I want to calculate text similarity using relaxed word movers distance. I have two different datasets (corpus). See below.

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "consultation of gynecologist",
  "x-ray leg arteries",
  "x-ray leg with 20km distance",
  "x-ray left hand"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "consultation (inspection) of gynecalogist",
  "MRI right leg arteries",
  "X-ray right leg arteries with special care"
), stringsAsFactors = F)

I am using text2vec package in R. It seems I am doing something wrong.

library(text2vec)
library(stringr)
prep_fun = function(x) {
  x %>% 
    # make text lower case
    str_to_lower %>% 
    # remove non-alphanumeric symbols
    str_replace_all("[^[:alnum:]]", " ") %>% 
    # collapse multiple spaces
    str_replace_all("\\s+", " ")
}

Combine both datasets

C = rbind(A, B)

C$name = prep_fun(C$name)

it = itoken(C$name, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary()
vectorizer = vocab_vectorizer(v)

Document Term Matrix

dtm = create_dtm(it, vectorizer)

Term Co-occurence Matrix

tcm = create_tcm(it, vectorizer, skip_grams_window = 3)

Glove Model

glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
wv = glove_model$fit_transform(tcm, n_iter = 10)

# get average of main and context vectors as proposed in GloVe paper
wv = wv + t(glove_model$components)
rwmd_model = RWMD$new(wv)
rwmd_dist = dist2(dtm[1:nrow(A), ], dtm[nrow(A)+1:nrow(C), ], method = rwmd_model, norm = 'none')
head(rwmd_dist)

          [,1]      [,2]      [,3]      [,4]
[1,] 0.1220713 0.7905035 0.3085216 0.4182328
[2,] 0.7043127 0.1883473 0.8031200 0.7038919
[3,] 0.1220713 0.7905035 0.3856520 0.4836772
[4,] 0.5340587 0.6259011 0.7146630 0.2513135
[5,] 0.3403019 0.5575993 0.7568583 0.5124514

Desired Output : "consultation of gynecologist" of dataframe A should be mapped to "consultation (inspection) of gynecalogist" of dataframe B. Similarly, text of dataframeA should be matched to text of dataframe B.

john
  • 1,026
  • 8
  • 19

1 Answers1

0

I am doing something similar or same, soon I will be uploading my trial. Right now I am trying to optimize vectors, window and figure if corpus of 5700 speeches ranging from 1000 - 2000 words on average [after removing stop words, stemming] is enough or not.

Will come back and post the link if still needed, but from what I can see, you did not tokenize the corpus - itokens is something different from what I understood. Also on the internet, the author uses the word_tokenizer function.

Lastly, try to use pdist2 function and have wanted text in separate rows in the data frame. It makes a parallel distance.