1

I was using latent semantic analysis in the text2vec package to generate word vectors and using transform to fit new data when I noticed something odd, the spaces not being lined up when trained on the same data.

There appears to be some inconsistency (or randomness?) in the method. Namely, even when re-running an LSA model on the exact same data, the resulting word vectors are wildly different, despite indentical input. When looking around I only found these old closed github issues link link and a mention in the changelog about LSA being cleaned up. I reproduced the behaviour using the movie_review dataset and (slightly modified) code from the documentation:

library(text2vec)
packageVersion("text2vec") # ‘0.5.1’
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# edit: make tcm symmetric:
tcm = tcm + Matrix::t(Matrix::triu(tcm)) 
n_topics = 10
lsa_1 = LatentSemanticAnalysis$new(n_topics)
d1 = lsa_1$fit_transform(tcm)
lsa_2 = LatentSemanticAnalysis$new(n_topics)
d2 = lsa_2$fit_transform(tcm)

# despite being trained on the same data, words have completely different vectors:
sim2(d1["film",,drop=F], d2["film",,drop=F])
# yields values like -0.993363 but sometimes 0.9888435 (should be 1)

mean(diag(sim2(d1, d2))) 
# e.g. -0.2316826
hist(diag(sim2(d1, d2)), main="self-similarity between models")
# note: these numbers are different every time!

# But: within each model, results seem consistent and reasonable:
# top similar words for "film":
head(sort(sim2(d1, d1["film",,drop=F])[,1],decreasing = T))
#    film     movie      show     piece territory       bay 
# 1.0000000 0.9873934 0.9803280 0.9732380 0.9680488 0.9668800 

# same in the second model:
head(sort(sim2(d2, d2["film",,drop=F])[,1],decreasing = T))
#      film     movie      show     piece territory       bay 
#  1.0000000 0.9873935 0.9803279 0.9732364 0.9680495 0.9668819

# transform works:
sim2(d2["film",,drop=F], transform(tcm["film",,drop=F], lsa_2 )) # yields 1

# LSA in quanteda doesn't have this problem, same data => same vectors
library(quanteda)
d1q = textmodel_lsa(as.dfm(tcm), 10)
d2q = textmodel_lsa(as.dfm(tcm), 10)
mean(diag(sim2(d1q$docs, d2q$docs)))  # yields 1
# the top synonyms for "film" are also a bit different with quanteda's LSA
#   film     movie      hunk      show territory       bay 
# 1.0000000 0.9770574 0.9675766 0.9642915 0.9577723 0.9573138

What's the deal, is it a bug, is this intended behaviour for some reason, or am I having a massive misunderstanding? (I'm kind of hoping for the latter...). If it's intended, why would quanteda behave differently?

user3554004
  • 1,044
  • 9
  • 24
  • I thinks this may be because of this bug - https://github.com/dselivanov/text2vec/issues/247. Can you check latest version from github? – Dmitriy Selivanov Feb 14 '19 at 06:22
  • @DmitriySelivanov I tried 0.5.1.5 from github, no difference. Also this bug (I thought it was resolved anyway?) would be an unlikely cause, being just a different formula - the key here is that the results vary between runs of LSA. At the same time, internally within a model, the vectors seem fine - see example above; also I get 0.39 spearman correlation on Simlex99 from an LSA trained on a relatively small English corpus, which would suggest the model works, it's just that the vectors are differently valued if you run the same model twice (I assumed there's no stochasticity in LSA/SVD). – user3554004 Feb 14 '19 at 16:24

2 Answers2

1

The issue is that your matrix seems ill-conditioned and hence you have numerical stability issues.

library(text2vec)
library(magrittr)
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)

# condition number
kappa(tcm)
# Inf

Now if you will do truncated SVD (algorithm behind LSA) you will notice that singular vectors are very close to zero:

library(irlba)
truncated_svd = irlba(tcm, 10)
str(truncated_svd)
# $ d    : num [1:10] 2139 1444 660 559 425 ...
# $ u    : num [1:4387, 1:10] -1.44e-04 -1.62e-04 -7.77e-05 -8.44e-04 -8.99e-04 ...
# $ v    : num [1:4387, 1:10] 6.98e-20 2.37e-20 4.09e-20 -4.73e-20 6.62e-20 ...
# $ iter : num 3
# $ mprod: num 50

Hence the sign of the embeddings is not stable and cosine angle between them is not stable as well.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Thanks, makes sense; but just to be clear, you are saying this is a problem with input data, not the method? Or are you saying it's not really an issue at all, i.e., I shouldn't be worried about the quality of the vectors within a single LSA? I also see this with larger corpora though, so it's not the "smallness" of 1k moviereviews. Also I don't know if that makes a big difference, but: forgot to add symmetry to the example, tcm = tcm + Matrix::t(Matrix::triu(tcm)) - now kappa() yields a finite but still very large value. – user3554004 Feb 19 '19 at 13:40
  • I would say this is a problem with data and it is not about the size, but rather a pattern. I've checked `quanteda` - it uses `svd` from `RSpectra` pkg. Mb for this particular edge case it works better/more stable. – Dmitriy Selivanov Feb 19 '19 at 17:49
  • 1
    Good point, so it's the underlying method. Although the `irlba` author seems to claim it's actually supposed to perform better than `RSpectra` with difficult problems: https://bwlewis.github.io/irlba/comparison.html And I'm not sure what would make the moviereviews corpus or my own historical corpus edge cases, I thought these sort of data would be where LSA/SVD would work well. – user3554004 Feb 19 '19 at 19:34
  • An extra comment in case somebody encounters the same issue and finds this: I just noticed vectors also seem to differ slightly between text2vec's LSA and quanteda's LSA in terms of top synonyms for target words, edited the Q. – user3554004 Feb 19 '19 at 19:46
  • yep, this seems the consequence of the poor numerical stability of the svd for this problem – Dmitriy Selivanov Feb 20 '19 at 05:47
0

Similar to how it works in sklearn in Python, using a truncated SVD function in R has a random number function built in. It is both what makes it so powerful for large model building but somewhat difficult for smaller uses. If you set your values to a seed set.seed() before the SVD matrix is created you shouldn't have an issue. This used to terrify me when doing LSA.

Let me know if that helps!