0

I'm running the LSI program from Gensim's Topics and Transformations tutorial and for some reason, the signs of the topic weights keep switching from positive to negative and vice versa. For example, this is what I get when I print using the line

for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

Run 1
[(0, 0.066007833960900791), (1, 0.52007033063618491), (2, -0.37649581219168904)]
[(0, 0.196675928591421), (1, 0.7609563167700063), (2, 0.5080674581001664)]
[(0, 0.089926399724459982), (1, 0.72418606267525132), (2, -0.408989731553764)]
[(0, 0.075858476521777865), (1, 0.63205515860034334), (2, -0.53935336057339001)]
[(0, 0.10150299184979866), (1, 0.57373084830029653), (2, 0.67093385852959075)]
[(0, 0.70321089393783254), (1, -0.1611518021402539), (2, -0.18266089635241448)]
[(0, 0.87747876731198449), (1, -0.16758906864658912), (2, -0.10880822642632856)]
[(0, 0.90986246868185872), (1, -0.14086553628718496), (2, 0.00087117874886860625)]
[(0, 0.61658253505692762), (1, 0.053929075663897361), (2, 0.25568697959599318)]

Run 2
[(0, 0.066007833960908563), (1, -0.52007033063618446), (2, -0.37649581219168959)]
[(0, 0.19667592859143226), (1, -0.76095631677000253), (2, 0.50806745810016629)]
[(0, 0.089926399724470751), (1, -0.72418606267525032), (2, -0.40898973155376284)]
[(0, 0.075858476521787177), (1, -0.63205515860034223), (2, -0.5393533605733889)]
[(0, 0.10150299184980684), (1, -0.57373084830029419), (2, 0.67093385852959098)]
[(0, 0.70321089393782976), (1, 0.16115180214026417), (2, -0.18266089635241456)]
[(0, 0.87747876731198149), (1, 0.16758906864660211), (2, -0.10880822642632891)]
[(0, 0.90986246868185627), (1, 0.14086553628719861), (2, 0.00087117874886795399)]
[(0, 0.61658253505692828), (1, -0.053929075663887563), (2, 0.25568697959599251)]

Run 3
[(0, 0.066007833960902929), (1, -0.52007033063618535), (2, 0.37649581219168821)]
[(0, 0.19667592859142491), (1, -0.76095631677000497), (2, -0.50806745810016662)]
[(0, 0.089926399724463771), (1, -0.7241860626752511), (2, 0.40898973155376317)]
[(0, 0.075858476521781085), (1, -0.63205515860034334), (2, 0.5393533605733889)]
[(0, 0.10150299184980124), (1, -0.57373084830029542), (2, -0.67093385852959064)]
[(0, 0.70321089393783143), (1, 0.16115180214025732), (2, 0.18266089635241564)]
[(0, 0.87747876731198304), (1, 0.16758906864659326), (2, 0.10880822642632952)]
[(0, 0.90986246868185761), (1, 0.1408655362871892), (2, -0.00087117874886778746)]
[(0, 0.61658253505692784), (1, -0.053929075663894419), (2, -0.25568697959599318)]

I am running Python 3.5.2 on a PC, coding in IntelliJ.

Anyone encountered this problem, using the Gensim library or elsewhere?

sophros
  • 14,672
  • 11
  • 46
  • 75
mudstick
  • 99
  • 5
  • 1
    Topic extraction in gensim is a probabilistic process. That is why the results differ from run to run. – DYZ Sep 22 '20 at 23:16
  • The probabilistic nature of topic extraction would explain slight variation in the absolute values of the weights. It would not explain the reversal of the direction of the document-topic relationship. – mudstick Sep 23 '20 at 00:20
  • What exactly is the version of gensim, NumPy that you are using? – sophros Sep 23 '20 at 09:04

2 Answers2

1

LSI model is nothing but an implementation of fast truncated SVD underneath it. SVD calculates eigen vectors and these vectors correspond to the topics. However, eigenvectors remain eigenvectors even after multiplying by -1. So the sign might keep flipping based on the how the algorithm is implemented. In fact it is the case with the SVD implementation of the popular library LAPACK and even the numpy implementation.

The sign really does not matter here, as multiplication by -1 is also an eigen vector.

mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • My understanding is that these vectors are not the eigenvectors (or rather, singular vectors, since we're doing SVD and not PCA here) themselves, but the "weight" or association of the document with each topic. Doesn't it make a huge difference whether a document has a positive or negative association with the topic? – mudstick Sep 23 '20 at 20:38
  • 1
    I got clarification on this in the Gensim Google group. Basically, the term-topic vectors and the document-topic vectors are the left- and right-singular vectors that rotate 180 degrees when the sign flips -- a functionally trivial rotation as you point out, so I'm marking your answer accepted. Thanks. – mudstick Sep 26 '20 at 20:24
0

There is a number of possibilities:

  1. Order of the topics can be different. Topic/vocabulary changes between runs. If you run it from scratch every time (incl. vocabulary generation, etc.) there is a possibility that the eventual topics that you see are changing between runs or vocabulary changes between runs which could explain the differences.
  2. The calculations are numerically unstable. This could happen if there was a value close to 0.0 which could get rounded either to -0.0 or +0.0 (depending on the order of calculation which sometimes can be different) and influence the sign of the result. This can be related to a generic numerical bug or a combination of software/hardware that causes it.
  3. Some other reason not yet identified.
sophros
  • 14,672
  • 11
  • 46
  • 75