1

I have gone through the other threads where its specified that in LDA the memory is proportional to numberOfTerms*numberOfTopics . In my case I have two datasets. In dataset A I have 250K Documents and around 500K terms here I am easily able to run for ~ 500 Topics. But in dataset B I have around 2 Million documents and 500K terms(we got here after some filtering) but here I am only able to run till 50 topics above that it throws memory exception.

So just wanted to understand if only number of terms and topics matter for memory why number of documents is causing this problem and is there any quick workaround which can avoid this.

Note : I know corpus can be wrapped around as an iteratable as specified in memory-efficient-lda-training-using-gensim-library but lets assume I already loaded the corpora in memory because of some other restrictions I have of keeping input data in different format so it can be run on different platforms for different algorithms. The point is I am able to run it for some lesser number of Topics after loading whole corpora in memory. So is there any workaround which can help it run for more number of topics. For example I was thinking adjusting chunksize might help but that didn't work.

Amit Kumar
  • 178
  • 1
  • 12
  • 1
    The available memory for `gensim` is probably reduced by the memory occupied by your dataset B. Can your verify the memory size of your dataset by doing `sys.getsizeof(b)` (supposing that `b` is your dataset)? It may confirm that. – Jundiaius Sep 04 '18 at 17:16
  • 1
    Dataset size wont interfere with the available memory when running Gensim. DataSet is processed in a separate process and the output is what we use as an Input for Gensim(this output is only Id's and frequency). Only thing possible is the corpus size itself after loading into memory(one which LDA in Gensim understands)is occupying that much memory. So may be I can check the size of Corpus I am passing in both cases and see if the difference is significant or not right ? – Amit Kumar Sep 04 '18 at 17:36
  • 1
    Just checked corpus size it should not be a problem as its varying in only few MB's. – Amit Kumar Sep 04 '18 at 19:34

0 Answers0