2

I am making use of the HDP implementation by Gensim to infer the topics of a dataset, but I have a question regarding the truncation level.

Is there a way to infer the most appropriate truncation level? I have noticed that the final number of topics is dependent on the value for truncation level selected.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
  • I've edited your question to improve clarity and grammar. I also removed the "thanks in advance" because greeting, and thanks are not needed here. – Stephen Ostermiller Oct 23 '18 at 13:32

1 Answers1

0

As an initial workaround, I was able to use the alternative 'tomotopy' package as described in this article to find a number of topics without inputting a 'top level truncation level' as required in the gensim package.

As for why this is happening, it goes beyond my mathematical ability, but as far as I can understand from the documentation and its onward links, setting the truncation is different from setting the number of topics. Rather, setting it provides a number of topics to allow, and then the model infers a smaller number that are used within the data. Happy to be corrected if others are more qualified to interpret!

I think this is happening for my data (and possibly for the original question posers as well) because the topics within it are not actually distinct enough, and therefore the model will default to splitting into as many topics as possible.

Note that the 'top level truncation' parameter we are referring to ('T' in the documentation) is set to 150 by default; if your gensim HDP model is outputting 150 topics then it is very possible that your data also has this issue.

starball
  • 20,030
  • 7
  • 43
  • 238
Agnes
  • 19
  • 3
  • This comment thread here may also be useful: https://groups.google.com/g/gensim/c/Eqkx942kBRU – Agnes Apr 30 '23 at 10:15