Understanding parameters in Gensim LDA Model

Question

I am using gensim.models.ldamodel.LdaModel to perform LDA, but I do not understand some of the parameters and cannot find explanations in the documentation. If someone has experience working with this, I would love further details of what these parameters signify. Specifically, I do not understand:

random_state
update_every
chunksize
passes
alpha
per_word_topics

I am working with a corpus of 500 documents which are roughly around 3-5 pages each (I unfortunately cannot share a snapshot of the data because of confidentiality reasons). Currently I have set

num_topics = 10
random_state = 100
update_every = 1
chunksize = 50
passes = 10
alpha = 'auto'
per_word_topics = True

but this is solely based off of an example I saw and I am not sure how generalizable that is to my data.

score 21 · Accepted Answer · answered Jun 12 '18 at 07:48

I wonder if you have seen this page?

Either way, let me explain a few things for you. The number of documents you use is small for the method (it works much better when trained on a data source of the size of Wikipedia). Therefore the results will be rather crude and you have to be aware of that. This is why you should not aim for a large number of topics (you chose 10 which could maybe go sensibly up to 20 in your case).

As for the other parameters:

random_state - this serves as a seed (in case you wanted to repeat exactly the training process)
chunksize - number of documents to consider at once (affects the memory consumption)
update_every - update the model every update_every chunksize chunks (essentially, this is for memory consumption optimization)
passes - how many times the algorithm is supposed to pass over the whole corpus
alpha - to cite the documentation:

can be set to an explicit array = prior of your choice. It also support special values of `‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.
per_word_topics - setting this to True allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. phi_value is another parameter that steers this process - it is a threshold for a word treated as indicative or not.

Optimal training process parameters are described particularly well in M. Hoffman et al., Online Learning for Latent Dirichlet Allocation.

For memory optimization of the training process or the model see this blog post.

Thanks for your response -- very helpful! Just to make sure I understand correctly, chunksize and update_every should not impact LDA's output or ability to identify topics, but rather it impacts on memory consumption? I just wanted to make sure those are two separate things. — Jane Sully, Jun 12 '18 at 13:30
With overly rare updates you risk the model will not capture the nuances of the data well enough. Numerically the results should be similar (a close approximation) if you do not deviate too far from the default values. — sophros, Jun 12 '18 at 14:24
I did not find much difference between per_word_topic = True & per_word_topic = False. — Nitesh Jindal, Jun 30 '19 at 09:18
@NiteshJindal - as almost everything in Machine Learning - it all depends on your dataset. — sophros, Jun 30 '19 at 09:21
I would love to add up on the "chunk_size" parameter (at least for the multicore version) that it might change the result. Have a look at this https://groups.google.com/g/gensim/c/FE7_FYSconA and this https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html — chAlexey, Dec 02 '20 at 17:28
@AliA.Jalil Random_state is the 'seed' of the training while 'passes' are the passes through the whole corpus. — Inaam Ilahi, Aug 02 '22 at 07:37

Understanding parameters in Gensim LDA Model

1 Answers1

Linked