I am using Relaxed Word Mover's Distance in the package text2vec
to compute the distance between documents, so as to identify the most similar document for each target document. Word vectors are compiled using FastText
available in the pacakage gensim
in Python. The length of the documents can vary from one word to over 50 words. Some documents are duplicated in the corpus. I assume that the distance between these duplicated should be very short and the value across different pair of the same documents should be the same. However, what I observe is that the distance of these identical pair can vary from close to 0 to over 1, and some other less relevant documents are even concluded to be closer than these identical pair. The command I use is as follows:
library(text2vec)
tokens = word_tokenizer(tolower(data$item))
v = create_vocabulary(itoken(tokens))
v = prune_vocabulary(v, term_count_min = 12, term_count_max = 1000000)
it = itoken(tokens)
# Use our filtered vocabulary
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 50)
#word vectors from FastText
wv_fasttext<-as.data.frame(wv_fasttext)
rownames(wv_fasttext) <- wv_fasttext[, 'na']
wv_fasttext$name<- NULL
wv_fasttext<- data.matrix(wv_fasttext, rownames.force = TRUE)
rwmd_model = RWMD$new(wv)
rwmd_distance = dist2(dtm[1:1000,], dtm[1:1000,], method = rwmd_model, norm
= 'none')
Is there any problem with the above model?