14

I am using gensim.models.ldamodel.LdaModel to perform LDA, but I do not understand some of the parameters and cannot find explanations in the documentation. If someone has experience working with this, I would love further details of what these parameters signify. Specifically, I do not understand:

  • random_state
  • update_every
  • chunksize
  • passes
  • alpha
  • per_word_topics

I am working with a corpus of 500 documents which are roughly around 3-5 pages each (I unfortunately cannot share a snapshot of the data because of confidentiality reasons). Currently I have set

  • num_topics = 10
  • random_state = 100
  • update_every = 1
  • chunksize = 50
  • passes = 10
  • alpha = 'auto'
  • per_word_topics = True

but this is solely based off of an example I saw and I am not sure how generalizable that is to my data.

aneesh joshi
  • 553
  • 5
  • 11
Jane Sully
  • 3,137
  • 10
  • 48
  • 87

1 Answers1

21

I wonder if you have seen this page?

Either way, let me explain a few things for you. The number of documents you use is small for the method (it works much better when trained on a data source of the size of Wikipedia). Therefore the results will be rather crude and you have to be aware of that. This is why you should not aim for a large number of topics (you chose 10 which could maybe go sensibly up to 20 in your case).

As for the other parameters:

  • random_state - this serves as a seed (in case you wanted to repeat exactly the training process)

  • chunksize - number of documents to consider at once (affects the memory consumption)

  • update_every - update the model every update_every chunksize chunks (essentially, this is for memory consumption optimization)

  • passes - how many times the algorithm is supposed to pass over the whole corpus

  • alpha - to cite the documentation:

    can be set to an explicit array = prior of your choice. It also support special values of `‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

  • per_word_topics - setting this to True allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. phi_value is another parameter that steers this process - it is a threshold for a word treated as indicative or not.

Optimal training process parameters are described particularly well in M. Hoffman et al., Online Learning for Latent Dirichlet Allocation.

For memory optimization of the training process or the model see this blog post.

sophros
  • 14,672
  • 11
  • 46
  • 75
  • Thanks for your response -- very helpful! Just to make sure I understand correctly, chunksize and update_every should not impact LDA's output or ability to identify topics, but rather it impacts on memory consumption? I just wanted to make sure those are two separate things. – Jane Sully Jun 12 '18 at 13:30
  • 1
    With overly rare updates you risk the model will not capture the nuances of the data well enough. Numerically the results should be similar (a close approximation) if you do not deviate too far from the default values. – sophros Jun 12 '18 at 14:24
  • I did not find much difference between per_word_topic = True & per_word_topic = False. – Nitesh Jindal Jun 30 '19 at 09:18
  • @NiteshJindal - as almost everything in Machine Learning - it all depends on your dataset. – sophros Jun 30 '19 at 09:21
  • 1
    I would love to add up on the "chunk_size" parameter (at least for the multicore version) that it might change the result. Have a look at this https://groups.google.com/g/gensim/c/FE7_FYSconA and this https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html – chAlexey Dec 02 '20 at 17:28
  • What difference between `random_state` & `passes` – Ali A. Jalil Jul 15 '22 at 09:01
  • @AliA.Jalil Random_state is the 'seed' of the training while 'passes' are the passes through the whole corpus. – Inaam Ilahi Aug 02 '22 at 07:37