1

I am using Relaxed Word Mover's Distance in the package text2vec to compute the distance between documents, so as to identify the most similar document for each target document. Word vectors are compiled using FastText available in the pacakage gensim in Python. The length of the documents can vary from one word to over 50 words. Some documents are duplicated in the corpus. I assume that the distance between these duplicated should be very short and the value across different pair of the same documents should be the same. However, what I observe is that the distance of these identical pair can vary from close to 0 to over 1, and some other less relevant documents are even concluded to be closer than these identical pair. The command I use is as follows:

library(text2vec)
tokens = word_tokenizer(tolower(data$item))

v = create_vocabulary(itoken(tokens))

v = prune_vocabulary(v, term_count_min = 12, term_count_max = 1000000)
it = itoken(tokens)

# Use our filtered vocabulary
vectorizer = vocab_vectorizer(v)

dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 50)

#word vectors from FastText
wv_fasttext<-as.data.frame(wv_fasttext)
rownames(wv_fasttext) <- wv_fasttext[, 'na']

wv_fasttext$name<- NULL
wv_fasttext<- data.matrix(wv_fasttext, rownames.force = TRUE)

rwmd_model = RWMD$new(wv)

rwmd_distance = dist2(dtm[1:1000,], dtm[1:1000,], method = rwmd_model, norm 
= 'none')

Is there any problem with the above model?

massisenergy
  • 1,764
  • 3
  • 14
  • 25
TMC
  • 11
  • 1
  • please provide reproducible example. Otherwise it hard to help. – Dmitriy Selivanov Dec 07 '18 at 19:00
  • E.g. One document contains only one word "Sports". The distance between this documents and 10 other documents containing exactly the same content, i.e. "Sports" are compared. The distance varies from 0.00, 0.41, 0.67, 1.0... to 1.33. On the contrary, the distance between this document and the document containing words "Sports other than golf" is only 0.52, and that with a irrelevant document "newspaper" is even 0.32. – TMC Dec 14 '18 at 09:46

0 Answers0